* Fwd: dup vs raid1 in single disk
[not found] <CACNDjuzntG5Saq5HHNeDUmq-=28riKAerkO=CD=zAW-QofbKSg@mail.gmail.com>
@ 2017-01-19 16:39 ` Alejandro R. Mosteo
2017-01-19 17:06 ` Austin S. Hemmelgarn
2017-01-19 18:23 ` Roman Mamedov
0 siblings, 2 replies; 10+ messages in thread
From: Alejandro R. Mosteo @ 2017-01-19 16:39 UTC (permalink / raw)
To: linux-btrfs
Hello list,
I was wondering, from a point of view of data safety, if there is any
difference between using dup or making a raid1 from two partitions in
the same disk. This is thinking on having some protection against the
typical aging HDD that starts to have bad sectors.
On a related note, I see this caveat about dup in the manpage:
"For example, a SSD drive can remap the blocks internally to a single
copy thus deduplicating them. This negates the purpose of increased
redunancy (sic) and just wastes space"
SSDs failure modes are different (more an all or nothing thing, I'm
told) so it wouldn't apply to the use case above, but I'm curious for
curiosity's sake if there would be any difference too.
Thanks,
Alex.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Fwd: dup vs raid1 in single disk
2017-01-19 16:39 ` Fwd: dup vs raid1 in single disk Alejandro R. Mosteo
@ 2017-01-19 17:06 ` Austin S. Hemmelgarn
2017-01-19 18:23 ` Roman Mamedov
1 sibling, 0 replies; 10+ messages in thread
From: Austin S. Hemmelgarn @ 2017-01-19 17:06 UTC (permalink / raw)
To: Alejandro R. Mosteo, linux-btrfs
On 2017-01-19 11:39, Alejandro R. Mosteo wrote:
> Hello list,
>
> I was wondering, from a point of view of data safety, if there is any
> difference between using dup or making a raid1 from two partitions in
> the same disk. This is thinking on having some protection against the
> typical aging HDD that starts to have bad sectors.
>
> On a related note, I see this caveat about dup in the manpage:
>
> "For example, a SSD drive can remap the blocks internally to a single
> copy thus deduplicating them. This negates the purpose of increased
> redunancy (sic) and just wastes space"
>
> SSDs failure modes are different (more an all or nothing thing, I'm
> told) so it wouldn't apply to the use case above, but I'm curious for
> curiosity's sake if there would be any difference too.
On a traditional HDD, there actually is a reasonable safety benefit to
using 2 partitions in raid1 mode over using dup mode. This is because
most traditional HDD firmware still keeps the mapping of physical
sectors to logical sectors mostly linear, so having separate partitions
will (usually) mean that the two copies are not located near each other
on physical media. A similar but weaker version of the same effect can
be achieved by using the 'ssd_spread' mount option, but I would not
suggest relying on that. This doesn't apply to hybrid drives (because
they move stuff around however they want like SSD's), or SMR drives
(because they rewrite large portions of the disk when one place gets
rewritten, so physical separation of the data copies doesn't get you as
much protection).
For most SSD's, there is no practical benefit because the FTL in the SSD
firmware generally maps physical sectors to logical sectors in whatever
arbitrary way it wants, which is usually not going to be linear.
As far as failure modes on an SSD, you usually see one of two things
happen, either the whole disk starts acting odd (or stops working), or
individual blocks a few MB in size (which seem to move around the disk
as they get over-written) start behaving odd. The first case is the
firmware or primary electronics going bad, while the second is
individual erase blocks going bad. As a general rule, SSD's will run
longer as they're going bad than HDD's will, but in both cases you
should look at replacing the device once you start seeing the error
counters going up consistently over time (or if you see them suddenly
jump to a much higher number).
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: dup vs raid1 in single disk
2017-01-19 16:39 ` Fwd: dup vs raid1 in single disk Alejandro R. Mosteo
2017-01-19 17:06 ` Austin S. Hemmelgarn
@ 2017-01-19 18:23 ` Roman Mamedov
2017-01-19 20:02 ` Austin S. Hemmelgarn
1 sibling, 1 reply; 10+ messages in thread
From: Roman Mamedov @ 2017-01-19 18:23 UTC (permalink / raw)
To: Alejandro R. Mosteo; +Cc: linux-btrfs
On Thu, 19 Jan 2017 17:39:37 +0100
"Alejandro R. Mosteo" <alejandro@mosteo.com> wrote:
> I was wondering, from a point of view of data safety, if there is any
> difference between using dup or making a raid1 from two partitions in
> the same disk. This is thinking on having some protection against the
> typical aging HDD that starts to have bad sectors.
RAID1 will write slower compared to DUP, as any optimization to make RAID1
devices work in parallel will cause a total performance disaster for you as
you will start trying to write to both partitions at the same time, turning
all linear writes into random ones, which are about two orders of magnitude
slower than linear on spinning hard drives. DUP shouldn't have this issue, but
still it will be twice slower than single, since you are writing everything
twice.
You could consider DUP data for when a disk is already known to be getting bad
sectors from time to time -- but then it's a fringe exercise to try and keep
using such disk in the first place. Yeah with DUP data DUP metadata you can
likely have some more life out of such disk as a throwaway storage space for
non-essential data, at half capacity, but is it worth the effort, as it's
likely to start failing progressively worse over time.
In all other cases the performance and storage space penalty of DUP within a
single device are way too great (and gained redundancy is too low) compared
to a proper system of single profile data + backups, or a RAID5/6 system (not
Btrfs-based) + backups.
> On a related note, I see this caveat about dup in the manpage:
>
> "For example, a SSD drive can remap the blocks internally to a single
> copy thus deduplicating them. This negates the purpose of increased
> redunancy (sic) and just wastes space"
That ability is vastly overestimated in the man page. There is no miracle
content-addressable storage system working at 500 MB/sec speeds all within a
little cheap controller on SSDs. Likely most of what it can do, is just
compress simple stuff, such as runs of zeroes or other repeating byte
sequences.
And the DUP mode is still useful on SSDs, for cases when one copy of the DUP
gets corrupted in-flight due to a bad controller or RAM or cable, you could
then restore that block from its good-CRC DUP copy.
--
With respect,
Roman
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: dup vs raid1 in single disk
2017-01-19 18:23 ` Roman Mamedov
@ 2017-01-19 20:02 ` Austin S. Hemmelgarn
2017-01-21 16:00 ` Alejandro R. Mosteo
2017-02-07 22:28 ` Kai Krakow
0 siblings, 2 replies; 10+ messages in thread
From: Austin S. Hemmelgarn @ 2017-01-19 20:02 UTC (permalink / raw)
To: Roman Mamedov, Alejandro R. Mosteo; +Cc: linux-btrfs
On 2017-01-19 13:23, Roman Mamedov wrote:
> On Thu, 19 Jan 2017 17:39:37 +0100
> "Alejandro R. Mosteo" <alejandro@mosteo.com> wrote:
>
>> I was wondering, from a point of view of data safety, if there is any
>> difference between using dup or making a raid1 from two partitions in
>> the same disk. This is thinking on having some protection against the
>> typical aging HDD that starts to have bad sectors.
>
> RAID1 will write slower compared to DUP, as any optimization to make RAID1
> devices work in parallel will cause a total performance disaster for you as
> you will start trying to write to both partitions at the same time, turning
> all linear writes into random ones, which are about two orders of magnitude
> slower than linear on spinning hard drives. DUP shouldn't have this issue, but
> still it will be twice slower than single, since you are writing everything
> twice.
As of right now, there will actually be near zero impact on write
performance (or at least, it's way less than the theoretical 50%)
because there really isn't any optimization to speak of in the
multi-device code. That will hopefully change over time, but it's not
likely to do so any time in the future since nobody appears to be
working on multi-device write performance.
>
> You could consider DUP data for when a disk is already known to be getting bad
> sectors from time to time -- but then it's a fringe exercise to try and keep
> using such disk in the first place. Yeah with DUP data DUP metadata you can
> likely have some more life out of such disk as a throwaway storage space for
> non-essential data, at half capacity, but is it worth the effort, as it's
> likely to start failing progressively worse over time.
>
> In all other cases the performance and storage space penalty of DUP within a
> single device are way too great (and gained redundancy is too low) compared
> to a proper system of single profile data + backups, or a RAID5/6 system (not
> Btrfs-based) + backups.
That really depends on your usage. In my case, I run DUP data on single
disks regularly. I still do backups of course, but the performance is
worth far less for me (especially in the cases where I'm using NVMe
SSD's which have performance measured in thousands of MB/s for both
reads and writes) than the ability to recover from transient data
corruption without needing to go to a backup.
As long as /home and any other write heavy directories are on a separate
partition, I would actually advocate using DUP data on your root
filesystem if you can afford the space simply because it's a whole lot
easier to recover other data if the root filesystem still works. Most
of the root filesystem except some stuff under /var follows a WORM
access pattern, and even the stuff that doesn't in /var is usually not
performance critical, so the write performance penalty won't have
anywhere near as much impact on how well the system runs as you might think.
There's also the fact that you're writing more metadata than data most
of the time unless you're dealing with really big files, and metadata is
already DUP mode (unless you are using an SSD), so the performance hit
isn't 50%, it's actually a bit more than half the ratio of data writes
to metadata writes.
>
>> On a related note, I see this caveat about dup in the manpage:
>>
>> "For example, a SSD drive can remap the blocks internally to a single
>> copy thus deduplicating them. This negates the purpose of increased
>> redunancy (sic) and just wastes space"
>
> That ability is vastly overestimated in the man page. There is no miracle
> content-addressable storage system working at 500 MB/sec speeds all within a
> little cheap controller on SSDs. Likely most of what it can do, is just
> compress simple stuff, such as runs of zeroes or other repeating byte
> sequences.
Most of those that do in-line compression don't implement it in
firmware, they implement it in hardware, and even DEFLATE can get 500
MB/second speeds if properly implemented in hardware. The firmware may
control how the hardware works, but it's usually hardware doing heavy
lifting in that case, and getting a good ASIC made that can hit the
required performance point for a reasonable compression algorithm like
LZ4 or Snappy is insanely cheap once you've gotten past the VLSI work.
>
> And the DUP mode is still useful on SSDs, for cases when one copy of the DUP
> gets corrupted in-flight due to a bad controller or RAM or cable, you could
> then restore that block from its good-CRC DUP copy.
The only window of time during which bad RAM could result in only one
copy of a block being bad is after the first copy is written but before
the second is, which is usually an insanely small amount of time. As
far as the cabling, the window for errors resulting in a single bad copy
of a block is pretty much the same as for RAM, and if they're
persistently bad, you're more likely to lose data for other reasons.
That said, I do still feel that DUP mode has value on SSD's. The
primary arguments against it are:
1. It wears out the SSD faster.
2. The blocks are likely to end up in the same erase block, and
therefore there will be no benefit.
The first argument is accurate, but not usually an issue for most
people. Average life expectancy for a decent SSD is well over 10 years,
which is more than twice the usual life expectancy for a consumer hard
drive. Putting it in further perspective, the 575GB SSD's have been
running essentially 24/7 for the past year and a half (13112 hours
powered on now), and have seen just short of 25.7TB of writes over that
time. This equates to roughly 2GB/hour, which is well within typical
desktop usage. It also means they've seen more than 44.5 times their
total capacity in writes. Despite this, the wear-out indicators all
show that I can still expect at least 9 years more of run-time on these.
Normalizing that, that means I'm likely to see between 8 and 12 years
of life on these. Equivalent stats for the HDD's I used to use (NAS
rated Seagate drives) gave me a roughly 3-5 year life expectancy, less
than half that of the SSD. In both cases however, you're talking well
beyond the typical life expectancy of anything short of a server or a
tight-embedded system, and worrying about a 4-year versus 8-year life
expectancy on your storage device is kind of pointless when you need to
upgrade the rest of the system in 3 years.
As far as the second argument against it, that one is partially correct,
but ignores an important factor that many people who don't do hardware
design (and some who do) don't often consider. The close temporal
proximity of the writes for each copy are likely to mean they end up in
the same erase block on the SSD (especially if the SSD has a large write
cache). However, that doesn't mean that one getting corrupted due to
device failure is guaranteed to corrupt the other. The reason for this
is exactly the same reason that single word errors in RAM are
exponentially more common than losing a whole chip or the whole memory
module: The primary error source is environmental noise (EMI, cosmic
rays, quantum interference, background radiation, etc), not system
failure. In other words, you're far more likely to lose a single cell
(which is usually not more than a single byte in the MLC flash that gets
used in most modern SSD's) in the erase block than the whole erase
block. In that event, you obviously have only got corruption in the
particular filesystem block that that particular cell was storing data for.
There's also a third argument for not using DUP on SSD's however:
The SSD already does most of the data integrity work itself.
This is only true of good SSD's, but many do have some degree of
built-in erasure coding in the firmware which can handle losing large
chunks of an erase block and still return the data safely. This is part
of the reason that you almost never see nice power-of-two sizes for
flash Storage despite flash chips being made that way them,selves (the
other part is the spare blocks). Depending on the degree of protection
provided by this erasure coding, it can actually cancel out my argument
against argument 2. In all practicality though, that requires you to
actually trust the SSD manufacturer to have implemented things properly
for it to be a valid counter-argument, and most people who would care
enough about data integrity to use BTRFS for that reason are not likely
to trust the storage device that much.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: dup vs raid1 in single disk
2017-01-19 20:02 ` Austin S. Hemmelgarn
@ 2017-01-21 16:00 ` Alejandro R. Mosteo
2017-02-07 22:28 ` Kai Krakow
1 sibling, 0 replies; 10+ messages in thread
From: Alejandro R. Mosteo @ 2017-01-21 16:00 UTC (permalink / raw)
Cc: linux-btrfs
Thanks Austin and Roman for the interesting discussion.
Alex.
On 19/01/17 21:02, Austin S. Hemmelgarn wrote:
> On 2017-01-19 13:23, Roman Mamedov wrote:
>> On Thu, 19 Jan 2017 17:39:37 +0100
>> "Alejandro R. Mosteo" <alejandro@mosteo.com> wrote:
>>
>>> I was wondering, from a point of view of data safety, if there is any
>>> difference between using dup or making a raid1 from two partitions in
>>> the same disk. This is thinking on having some protection against the
>>> typical aging HDD that starts to have bad sectors.
>>
>> RAID1 will write slower compared to DUP, as any optimization to make
>> RAID1
>> devices work in parallel will cause a total performance disaster for
>> you as
>> you will start trying to write to both partitions at the same time,
>> turning
>> all linear writes into random ones, which are about two orders of
>> magnitude
>> slower than linear on spinning hard drives. DUP shouldn't have this
>> issue, but
>> still it will be twice slower than single, since you are writing
>> everything
>> twice.
> As of right now, there will actually be near zero impact on write
> performance (or at least, it's way less than the theoretical 50%)
> because there really isn't any optimization to speak of in the
> multi-device code. That will hopefully change over time, but it's not
> likely to do so any time in the future since nobody appears to be
> working on multi-device write performance.
>>
>> You could consider DUP data for when a disk is already known to be
>> getting bad
>> sectors from time to time -- but then it's a fringe exercise to try
>> and keep
>> using such disk in the first place. Yeah with DUP data DUP metadata
>> you can
>> likely have some more life out of such disk as a throwaway storage
>> space for
>> non-essential data, at half capacity, but is it worth the effort, as
>> it's
>> likely to start failing progressively worse over time.
>>
>> In all other cases the performance and storage space penalty of DUP
>> within a
>> single device are way too great (and gained redundancy is too low)
>> compared
>> to a proper system of single profile data + backups, or a RAID5/6
>> system (not
>> Btrfs-based) + backups.
> That really depends on your usage. In my case, I run DUP data on
> single disks regularly. I still do backups of course, but the
> performance is worth far less for me (especially in the cases where
> I'm using NVMe SSD's which have performance measured in thousands of
> MB/s for both reads and writes) than the ability to recover from
> transient data corruption without needing to go to a backup.
>
> As long as /home and any other write heavy directories are on a
> separate partition, I would actually advocate using DUP data on your
> root filesystem if you can afford the space simply because it's a
> whole lot easier to recover other data if the root filesystem still
> works. Most of the root filesystem except some stuff under /var
> follows a WORM access pattern, and even the stuff that doesn't in /var
> is usually not performance critical, so the write performance penalty
> won't have anywhere near as much impact on how well the system runs as
> you might think.
>
> There's also the fact that you're writing more metadata than data most
> of the time unless you're dealing with really big files, and metadata
> is already DUP mode (unless you are using an SSD), so the performance
> hit isn't 50%, it's actually a bit more than half the ratio of data
> writes to metadata writes.
>>
>>> On a related note, I see this caveat about dup in the manpage:
>>>
>>> "For example, a SSD drive can remap the blocks internally to a single
>>> copy thus deduplicating them. This negates the purpose of increased
>>> redunancy (sic) and just wastes space"
>>
>> That ability is vastly overestimated in the man page. There is no
>> miracle
>> content-addressable storage system working at 500 MB/sec speeds all
>> within a
>> little cheap controller on SSDs. Likely most of what it can do, is just
>> compress simple stuff, such as runs of zeroes or other repeating byte
>> sequences.
> Most of those that do in-line compression don't implement it in
> firmware, they implement it in hardware, and even DEFLATE can get 500
> MB/second speeds if properly implemented in hardware. The firmware
> may control how the hardware works, but it's usually hardware doing
> heavy lifting in that case, and getting a good ASIC made that can hit
> the required performance point for a reasonable compression algorithm
> like LZ4 or Snappy is insanely cheap once you've gotten past the VLSI
> work.
>>
>> And the DUP mode is still useful on SSDs, for cases when one copy of
>> the DUP
>> gets corrupted in-flight due to a bad controller or RAM or cable, you
>> could
>> then restore that block from its good-CRC DUP copy.
> The only window of time during which bad RAM could result in only one
> copy of a block being bad is after the first copy is written but
> before the second is, which is usually an insanely small amount of
> time. As far as the cabling, the window for errors resulting in a
> single bad copy of a block is pretty much the same as for RAM, and if
> they're persistently bad, you're more likely to lose data for other
> reasons.
>
> That said, I do still feel that DUP mode has value on SSD's. The
> primary arguments against it are:
> 1. It wears out the SSD faster.
> 2. The blocks are likely to end up in the same erase block, and
> therefore there will be no benefit.
>
> The first argument is accurate, but not usually an issue for most
> people. Average life expectancy for a decent SSD is well over 10
> years, which is more than twice the usual life expectancy for a
> consumer hard drive. Putting it in further perspective, the 575GB
> SSD's have been running essentially 24/7 for the past year and a half
> (13112 hours powered on now), and have seen just short of 25.7TB of
> writes over that time. This equates to roughly 2GB/hour, which is
> well within typical desktop usage. It also means they've seen more
> than 44.5 times their total capacity in writes. Despite this, the
> wear-out indicators all show that I can still expect at least 9 years
> more of run-time on these. Normalizing that, that means I'm likely to
> see between 8 and 12 years of life on these. Equivalent stats for the
> HDD's I used to use (NAS rated Seagate drives) gave me a roughly 3-5
> year life expectancy, less than half that of the SSD. In both cases
> however, you're talking well beyond the typical life expectancy of
> anything short of a server or a tight-embedded system, and worrying
> about a 4-year versus 8-year life expectancy on your storage device is
> kind of pointless when you need to upgrade the rest of the system in 3
> years.
>
> As far as the second argument against it, that one is partially
> correct, but ignores an important factor that many people who don't do
> hardware design (and some who do) don't often consider. The close
> temporal proximity of the writes for each copy are likely to mean they
> end up in the same erase block on the SSD (especially if the SSD has a
> large write cache). However, that doesn't mean that one getting
> corrupted due to device failure is guaranteed to corrupt the other.
> The reason for this is exactly the same reason that single word errors
> in RAM are exponentially more common than losing a whole chip or the
> whole memory module: The primary error source is environmental noise
> (EMI, cosmic rays, quantum interference, background radiation, etc),
> not system failure. In other words, you're far more likely to lose a
> single cell (which is usually not more than a single byte in the MLC
> flash that gets used in most modern SSD's) in the erase block than the
> whole erase block. In that event, you obviously have only got
> corruption in the particular filesystem block that that particular
> cell was storing data for.
>
> There's also a third argument for not using DUP on SSD's however:
> The SSD already does most of the data integrity work itself.
> This is only true of good SSD's, but many do have some degree of
> built-in erasure coding in the firmware which can handle losing large
> chunks of an erase block and still return the data safely. This is
> part of the reason that you almost never see nice power-of-two sizes
> for flash Storage despite flash chips being made that way them,selves
> (the other part is the spare blocks). Depending on the degree of
> protection provided by this erasure coding, it can actually cancel out
> my argument against argument 2. In all practicality though, that
> requires you to actually trust the SSD manufacturer to have
> implemented things properly for it to be a valid counter-argument, and
> most people who would care enough about data integrity to use BTRFS
> for that reason are not likely to trust the storage device that much.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: dup vs raid1 in single disk
2017-01-19 20:02 ` Austin S. Hemmelgarn
2017-01-21 16:00 ` Alejandro R. Mosteo
@ 2017-02-07 22:28 ` Kai Krakow
2017-02-07 22:46 ` Hans van Kranenburg
` (3 more replies)
1 sibling, 4 replies; 10+ messages in thread
From: Kai Krakow @ 2017-02-07 22:28 UTC (permalink / raw)
To: linux-btrfs
Am Thu, 19 Jan 2017 15:02:14 -0500
schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> On 2017-01-19 13:23, Roman Mamedov wrote:
> > On Thu, 19 Jan 2017 17:39:37 +0100
> > "Alejandro R. Mosteo" <alejandro@mosteo.com> wrote:
> >
> >> I was wondering, from a point of view of data safety, if there is
> >> any difference between using dup or making a raid1 from two
> >> partitions in the same disk. This is thinking on having some
> >> protection against the typical aging HDD that starts to have bad
> >> sectors.
> >
> > RAID1 will write slower compared to DUP, as any optimization to
> > make RAID1 devices work in parallel will cause a total performance
> > disaster for you as you will start trying to write to both
> > partitions at the same time, turning all linear writes into random
> > ones, which are about two orders of magnitude slower than linear on
> > spinning hard drives. DUP shouldn't have this issue, but still it
> > will be twice slower than single, since you are writing everything
> > twice.
> As of right now, there will actually be near zero impact on write
> performance (or at least, it's way less than the theoretical 50%)
> because there really isn't any optimization to speak of in the
> multi-device code. That will hopefully change over time, but it's
> not likely to do so any time in the future since nobody appears to be
> working on multi-device write performance.
I think that's only true if you don't account the seek overhead. In
single device RAID1 mode you will always seek half of the device while
writing data, and even when reading between odd and even PIDs. In
contrast, DUP mode doesn't guarantee your seeks to be shorter but from
a statistical point of view, on the average it should be shorter. So it
should yield better performance (tho I wouldn't expect it to be
observable, depending on your workload).
So, on devices having no seek overhead (aka SSD), it is probably true
(minus bus bandwidth considerations). For HDD I'd prefer DUP.
>From data safety point of view: It's more likely that adjacent
and nearby sectors are bad. So DUP imposes a higher risk of written
data being written to only bad sectors - which means data loss or even
file system loss (if metadata hits this problem).
To be realistic: I wouldn't trade space usage for duplicate data on an
already failing disk, no matter if it's DUP or RAID1. HDD disk space is
cheap, and using such a scenario is just waste of performance AND
space - no matter what. I don't understand the purpose of this. It just
results in fake safety.
Better get two separate devices half the size. There's a better chance
of getting a better cost/space ratio anyways, plus better performance
and safety.
> There's also the fact that you're writing more metadata than data
> most of the time unless you're dealing with really big files, and
> metadata is already DUP mode (unless you are using an SSD), so the
> performance hit isn't 50%, it's actually a bit more than half the
> ratio of data writes to metadata writes.
> >
> >> On a related note, I see this caveat about dup in the manpage:
> >>
> >> "For example, a SSD drive can remap the blocks internally to a
> >> single copy thus deduplicating them. This negates the purpose of
> >> increased redunancy (sic) and just wastes space"
> >
> > That ability is vastly overestimated in the man page. There is no
> > miracle content-addressable storage system working at 500 MB/sec
> > speeds all within a little cheap controller on SSDs. Likely most of
> > what it can do, is just compress simple stuff, such as runs of
> > zeroes or other repeating byte sequences.
> Most of those that do in-line compression don't implement it in
> firmware, they implement it in hardware, and even DEFLATE can get 500
> MB/second speeds if properly implemented in hardware. The firmware
> may control how the hardware works, but it's usually hardware doing
> heavy lifting in that case, and getting a good ASIC made that can hit
> the required performance point for a reasonable compression algorithm
> like LZ4 or Snappy is insanely cheap once you've gotten past the VLSI
> work.
I still thinks it's a myth... The overhead of managing inline
deduplication is just way too high to implement it without jumping
through expensive hoops. Most workloads have almost zero deduplication
potential. And even when, their temporal occurrence is spaced so far
that an inline deduplicator won't catch it.
If it would be all so easy, btrfs would already have it working in
mainline. I don't even remember that those patches is still being
worked on.
With this in mind, I think dup metadata is still a good think to have
even on SSD and I would always force to enable it.
Potential for deduplication is only when using snapshots (which already
are deduplicated when taken) or when handling user data on a file
server in a multi-user environment. Users tend to copy their files all
over the place - multiple directories of multiple gigabytes. Potential
is also where you're working with client machine backups or vm images.
I regularly see deduplication efficiency of 30-60% in such scenarios -
file servers mostly which I'm handling. But due to temporally far
spaced occurrence of duplicate blocks, only offline or nearline
deduplication works here.
> > And the DUP mode is still useful on SSDs, for cases when one copy
> > of the DUP gets corrupted in-flight due to a bad controller or RAM
> > or cable, you could then restore that block from its good-CRC DUP
> > copy.
> The only window of time during which bad RAM could result in only one
> copy of a block being bad is after the first copy is written but
> before the second is, which is usually an insanely small amount of
> time. As far as the cabling, the window for errors resulting in a
> single bad copy of a block is pretty much the same as for RAM, and if
> they're persistently bad, you're more likely to lose data for other
> reasons.
It depends on the design of the software. You're true if this memory
block is simply a single block throughout its lifetime in RAM before
written to storage. But if it is already handled as duplicate block in
memory, odds are different. I hope btrfs is doing this right... ;-)
> That said, I do still feel that DUP mode has value on SSD's. The
> primary arguments against it are:
> 1. It wears out the SSD faster.
I don't think this is a huge factor, even more when looking at TBW
capabilities of modern SSDs. And prices are low enough to better swap
early than waiting for the disaster hitting you. Instead, you can still
use the old SSD for archival storage (but this has drawbacks, don't
leave them without power for months or years!) or as a shock resistent
USB mobile drive on the go.
> 2. The blocks are likely to end up in the same erase block, and
> therefore there will be no benefit.
Oh, this is probably a point to really think about... Would ssd_spread
help here?
> The first argument is accurate, but not usually an issue for most
> people. Average life expectancy for a decent SSD is well over 10
> years, which is more than twice the usual life expectancy for a
> consumer hard drive.
Well, my first SSD (128 GB) was worn (according to SMART) after only 12
months. Bigger drives wear much slower. I now have a 500 GB SSD and
looking at SMART it projects to serve me well for the next 3-4 years
or longer. But it will be worn out then. But I'm pretty sure I'll get a
new drive until then - for performance and space reasons. My high usage
pattern probably results from using the drives for bcache in write-back
mode. Btrfs as the bcache user does it's own job (because of CoW) of
pressing much more data through bcache than normal expectations.
> As far as the second argument against it, that one is partially
> correct, but ignores an important factor that many people who don't
> do hardware design (and some who do) don't often consider. The close
> temporal proximity of the writes for each copy are likely to mean
> they end up in the same erase block on the SSD (especially if the SSD
> has a large write cache).
Deja vu...
> However, that doesn't mean that one
> getting corrupted due to device failure is guaranteed to corrupt the
> other. The reason for this is exactly the same reason that single
> word errors in RAM are exponentially more common than losing a whole
> chip or the whole memory module: The primary error source is
> environmental noise (EMI, cosmic rays, quantum interference,
> background radiation, etc), not system failure. In other words,
> you're far more likely to lose a single cell (which is usually not
> more than a single byte in the MLC flash that gets used in most
> modern SSD's) in the erase block than the whole erase block. In that
> event, you obviously have only got corruption in the particular
> filesystem block that that particular cell was storing data for.
Sounds reasonable...
> There's also a third argument for not using DUP on SSD's however:
> The SSD already does most of the data integrity work itself.
DUP is really not for integrity but for consistency. If one copy of the
block becomes damaged for perfectly reasonable instructions sent by the
OS (from the drive firmware perspective), that block has perfect data
integrity. But if it was the single copy of a metadata block, your FS
is probably toast now. In DUP mode you still have the other copy for
consistent filesystem structures. With this copy, the OS can now restore
filesystem integrity (which is levels above block level integrity).
--
Regards,
Kai
Replies to list-only preferred.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: dup vs raid1 in single disk
2017-02-07 22:28 ` Kai Krakow
@ 2017-02-07 22:46 ` Hans van Kranenburg
2017-02-08 0:39 ` Dan Mons
` (2 subsequent siblings)
3 siblings, 0 replies; 10+ messages in thread
From: Hans van Kranenburg @ 2017-02-07 22:46 UTC (permalink / raw)
To: Kai Krakow, linux-btrfs
On 02/07/2017 11:28 PM, Kai Krakow wrote:
> Am Thu, 19 Jan 2017 15:02:14 -0500
> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>
>> On 2017-01-19 13:23, Roman Mamedov wrote:
>>> On Thu, 19 Jan 2017 17:39:37 +0100
>>> [...]
>>> And the DUP mode is still useful on SSDs, for cases when one copy
>>> of the DUP gets corrupted in-flight due to a bad controller or RAM
>>> or cable, you could then restore that block from its good-CRC DUP
>>> copy.
>> The only window of time during which bad RAM could result in only one
>> copy of a block being bad is after the first copy is written but
>> before the second is, which is usually an insanely small amount of
>> time. As far as the cabling, the window for errors resulting in a
>> single bad copy of a block is pretty much the same as for RAM, and if
>> they're persistently bad, you're more likely to lose data for other
>> reasons.
>
> It depends on the design of the software. You're true if this memory
> block is simply a single block throughout its lifetime in RAM before
> written to storage. But if it is already handled as duplicate block in
> memory, odds are different. I hope btrfs is doing this right... ;-)
In memory, it's just one copy, happily sitting around, getting corrupted
by cosmic rays and other stuff done to it by aliens, after which a valid
checksum is calculated for the corrupt data, after which it goes on its
way to disk, twice. Yay.
>> That said, I do still feel that DUP mode has value on SSD's. The
>> primary arguments against it are:
>> 1. It wears out the SSD faster.
>
> I don't think this is a huge factor, even more when looking at TBW
> capabilities of modern SSDs. And prices are low enough to better swap
> early than waiting for the disaster hitting you. Instead, you can still
> use the old SSD for archival storage (but this has drawbacks, don't
> leave them without power for months or years!) or as a shock resistent
> USB mobile drive on the go.
>
>> 2. The blocks are likely to end up in the same erase block, and
>> therefore there will be no benefit.
>
> Oh, this is probably a point to really think about... Would ssd_spread
> help here?
I think there was another one, SSD firmware deduplicating writes,
converting the DUP into single again, giving a false idea of it being DUP.
This is one that can be solved by e.g. using disk encryption, which
causes same writes to show up as different data on disk.
--
Hans van Kranenburg
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: dup vs raid1 in single disk
2017-02-07 22:28 ` Kai Krakow
2017-02-07 22:46 ` Hans van Kranenburg
@ 2017-02-08 0:39 ` Dan Mons
2017-02-08 9:14 ` Alejandro R. Mosteo
2017-02-08 13:02 ` Austin S. Hemmelgarn
3 siblings, 0 replies; 10+ messages in thread
From: Dan Mons @ 2017-02-08 0:39 UTC (permalink / raw)
To: Kai Krakow; +Cc: linux-btrfs
On 8 February 2017 at 08:28, Kai Krakow <hurikhan77@gmail.com> wrote:
> I still thinks it's a myth... The overhead of managing inline
> deduplication is just way too high to implement it without jumping
> through expensive hoops. Most workloads have almost zero deduplication
> potential. And even when, their temporal occurrence is spaced so far
> that an inline deduplicator won't catch it.
>
> If it would be all so easy, btrfs would already have it working in
> mainline. I don't even remember that those patches is still being
> worked on.
>
> With this in mind, I think dup metadata is still a good think to have
> even on SSD and I would always force to enable it.
>
> Potential for deduplication is only when using snapshots (which already
> are deduplicated when taken) or when handling user data on a file
> server in a multi-user environment. Users tend to copy their files all
> over the place - multiple directories of multiple gigabytes. Potential
> is also where you're working with client machine backups or vm images.
> I regularly see deduplication efficiency of 30-60% in such scenarios -
> file servers mostly which I'm handling. But due to temporally far
> spaced occurrence of duplicate blocks, only offline or nearline
> deduplication works here.
I'm a sysadmin by trade, managing many PB of storage for a media
company. Our primary storage are Oracle ZFS appliances, and all of
our secondary/nearline storage is Linux+BtrFS.
ZFS's inline deduplication is awful. It consumes enormous amounts of
RAM that is orders of magnitude more valuable as ARC/Cache, and
becomes immediately useless whenever a storage node is rebooted
(necessary to apply mandatory security patches) and the in-memory
tables are lost (meaning cold data is rarely re-examined, and the
inline dedup becomes less efficient).
Conversely, I use "dupremove" as a one-shot/offline deduplication
tool on all of our BtrFS storage. I can be set as a cron job to be
done outside of business hours, and use an SQLite database to store
the necessary dedup hash information on disk, rather than in RAM.
>From the point of view of someone who manages large amounts of long
term centralised storage, this is a far superior way to deal with
deduplication, as it offers more flexibility and far better
space-saving ratios at a lower memory cost.
We trialled ZFS dedup for a few months, and decided to turn it off, as
there was far less benefit to ZFS using all that RAM for dedup than
there was for it to be cache. I've been requesting Oracle offer a
similar offline dedup tool for their ZFS appliance for a very long
time, and if BtrFS ever did offer inline dedup, I wouldn't bother
using it for all of the reasons above.
-Dan
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: dup vs raid1 in single disk
2017-02-07 22:28 ` Kai Krakow
2017-02-07 22:46 ` Hans van Kranenburg
2017-02-08 0:39 ` Dan Mons
@ 2017-02-08 9:14 ` Alejandro R. Mosteo
2017-02-08 13:02 ` Austin S. Hemmelgarn
3 siblings, 0 replies; 10+ messages in thread
From: Alejandro R. Mosteo @ 2017-02-08 9:14 UTC (permalink / raw)
To: Kai Krakow, linux-btrfs
On 07/02/17 23:28, Kai Krakow wrote:
> To be realistic: I wouldn't trade space usage for duplicate data on an
> already failing disk, no matter if it's DUP or RAID1. HDD disk space is
> cheap, and using such a scenario is just waste of performance AND
> space - no matter what. I don't understand the purpose of this. It just
> results in fake safety.
The disk is already replaced and no longer my workstation main drive. I
work with large datasets in my research, and I don't care much about
sustained I/O efficiency, since they're only read when needed. Hence, is
a matter of juicing out the last life of that disk, instead of
discarding it right away. This way I can have one extra local storage
that may spare me the copy from a remote, so I prefer to play with it
until it dies. Besides, it affords me a chance to play with btrfs/zfs in
ways that I wouldn't normally risk, and I can also assess their behavior
with a truly failing disk.
In the end, after a destructive write pass with badblocks, the disk
increasing uncorrectable sectors have disappeared... go figure. So right
now I have a btrfs filesystem built with single profile on top of four
differently sized partitions. When/if bad blocks reappear I'll test some
raid configuration; probably raidz unless btrfs raid5 is somewhat usable
by then (why go with half a disk worth when you can have 2/3? ;-))
Thanks for your justified concern though.
Alex.
> Better get two separate devices half the size. There's a better chance
> of getting a better cost/space ratio anyways, plus better performance
> and safety.
>
>> There's also the fact that you're writing more metadata than data
>> most of the time unless you're dealing with really big files, and
>> metadata is already DUP mode (unless you are using an SSD), so the
>> performance hit isn't 50%, it's actually a bit more than half the
>> ratio of data writes to metadata writes.
>>>
>>>> On a related note, I see this caveat about dup in the manpage:
>>>>
>>>> "For example, a SSD drive can remap the blocks internally to a
>>>> single copy thus deduplicating them. This negates the purpose of
>>>> increased redunancy (sic) and just wastes space"
>>> That ability is vastly overestimated in the man page. There is no
>>> miracle content-addressable storage system working at 500 MB/sec
>>> speeds all within a little cheap controller on SSDs. Likely most of
>>> what it can do, is just compress simple stuff, such as runs of
>>> zeroes or other repeating byte sequences.
>> Most of those that do in-line compression don't implement it in
>> firmware, they implement it in hardware, and even DEFLATE can get 500
>> MB/second speeds if properly implemented in hardware. The firmware
>> may control how the hardware works, but it's usually hardware doing
>> heavy lifting in that case, and getting a good ASIC made that can hit
>> the required performance point for a reasonable compression algorithm
>> like LZ4 or Snappy is insanely cheap once you've gotten past the VLSI
>> work.
> I still thinks it's a myth... The overhead of managing inline
> deduplication is just way too high to implement it without jumping
> through expensive hoops. Most workloads have almost zero deduplication
> potential. And even when, their temporal occurrence is spaced so far
> that an inline deduplicator won't catch it.
>
> If it would be all so easy, btrfs would already have it working in
> mainline. I don't even remember that those patches is still being
> worked on.
>
> With this in mind, I think dup metadata is still a good think to have
> even on SSD and I would always force to enable it.
>
> Potential for deduplication is only when using snapshots (which already
> are deduplicated when taken) or when handling user data on a file
> server in a multi-user environment. Users tend to copy their files all
> over the place - multiple directories of multiple gigabytes. Potential
> is also where you're working with client machine backups or vm images.
> I regularly see deduplication efficiency of 30-60% in such scenarios -
> file servers mostly which I'm handling. But due to temporally far
> spaced occurrence of duplicate blocks, only offline or nearline
> deduplication works here.
>
>>> And the DUP mode is still useful on SSDs, for cases when one copy
>>> of the DUP gets corrupted in-flight due to a bad controller or RAM
>>> or cable, you could then restore that block from its good-CRC DUP
>>> copy.
>> The only window of time during which bad RAM could result in only one
>> copy of a block being bad is after the first copy is written but
>> before the second is, which is usually an insanely small amount of
>> time. As far as the cabling, the window for errors resulting in a
>> single bad copy of a block is pretty much the same as for RAM, and if
>> they're persistently bad, you're more likely to lose data for other
>> reasons.
> It depends on the design of the software. You're true if this memory
> block is simply a single block throughout its lifetime in RAM before
> written to storage. But if it is already handled as duplicate block in
> memory, odds are different. I hope btrfs is doing this right... ;-)
>
>> That said, I do still feel that DUP mode has value on SSD's. The
>> primary arguments against it are:
>> 1. It wears out the SSD faster.
> I don't think this is a huge factor, even more when looking at TBW
> capabilities of modern SSDs. And prices are low enough to better swap
> early than waiting for the disaster hitting you. Instead, you can still
> use the old SSD for archival storage (but this has drawbacks, don't
> leave them without power for months or years!) or as a shock resistent
> USB mobile drive on the go.
>
>> 2. The blocks are likely to end up in the same erase block, and
>> therefore there will be no benefit.
> Oh, this is probably a point to really think about... Would ssd_spread
> help here?
>
>> The first argument is accurate, but not usually an issue for most
>> people. Average life expectancy for a decent SSD is well over 10
>> years, which is more than twice the usual life expectancy for a
>> consumer hard drive.
> Well, my first SSD (128 GB) was worn (according to SMART) after only 12
> months. Bigger drives wear much slower. I now have a 500 GB SSD and
> looking at SMART it projects to serve me well for the next 3-4 years
> or longer. But it will be worn out then. But I'm pretty sure I'll get a
> new drive until then - for performance and space reasons. My high usage
> pattern probably results from using the drives for bcache in write-back
> mode. Btrfs as the bcache user does it's own job (because of CoW) of
> pressing much more data through bcache than normal expectations.
>
>> As far as the second argument against it, that one is partially
>> correct, but ignores an important factor that many people who don't
>> do hardware design (and some who do) don't often consider. The close
>> temporal proximity of the writes for each copy are likely to mean
>> they end up in the same erase block on the SSD (especially if the SSD
>> has a large write cache).
> Deja vu...
>
>> However, that doesn't mean that one
>> getting corrupted due to device failure is guaranteed to corrupt the
>> other. The reason for this is exactly the same reason that single
>> word errors in RAM are exponentially more common than losing a whole
>> chip or the whole memory module: The primary error source is
>> environmental noise (EMI, cosmic rays, quantum interference,
>> background radiation, etc), not system failure. In other words,
>> you're far more likely to lose a single cell (which is usually not
>> more than a single byte in the MLC flash that gets used in most
>> modern SSD's) in the erase block than the whole erase block. In that
>> event, you obviously have only got corruption in the particular
>> filesystem block that that particular cell was storing data for.
> Sounds reasonable...
>
>> There's also a third argument for not using DUP on SSD's however:
>> The SSD already does most of the data integrity work itself.
> DUP is really not for integrity but for consistency. If one copy of the
> block becomes damaged for perfectly reasonable instructions sent by the
> OS (from the drive firmware perspective), that block has perfect data
> integrity. But if it was the single copy of a metadata block, your FS
> is probably toast now. In DUP mode you still have the other copy for
> consistent filesystem structures. With this copy, the OS can now restore
> filesystem integrity (which is levels above block level integrity).
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: dup vs raid1 in single disk
2017-02-07 22:28 ` Kai Krakow
` (2 preceding siblings ...)
2017-02-08 9:14 ` Alejandro R. Mosteo
@ 2017-02-08 13:02 ` Austin S. Hemmelgarn
3 siblings, 0 replies; 10+ messages in thread
From: Austin S. Hemmelgarn @ 2017-02-08 13:02 UTC (permalink / raw)
To: Kai Krakow, linux-btrfs
On 2017-02-07 17:28, Kai Krakow wrote:
> Am Thu, 19 Jan 2017 15:02:14 -0500
> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>
>> On 2017-01-19 13:23, Roman Mamedov wrote:
>>> On Thu, 19 Jan 2017 17:39:37 +0100
>>> "Alejandro R. Mosteo" <alejandro@mosteo.com> wrote:
>>>
>>>> I was wondering, from a point of view of data safety, if there is
>>>> any difference between using dup or making a raid1 from two
>>>> partitions in the same disk. This is thinking on having some
>>>> protection against the typical aging HDD that starts to have bad
>>>> sectors.
>>>
>>> RAID1 will write slower compared to DUP, as any optimization to
>>> make RAID1 devices work in parallel will cause a total performance
>>> disaster for you as you will start trying to write to both
>>> partitions at the same time, turning all linear writes into random
>>> ones, which are about two orders of magnitude slower than linear on
>>> spinning hard drives. DUP shouldn't have this issue, but still it
>>> will be twice slower than single, since you are writing everything
>>> twice.
>> As of right now, there will actually be near zero impact on write
>> performance (or at least, it's way less than the theoretical 50%)
>> because there really isn't any optimization to speak of in the
>> multi-device code. That will hopefully change over time, but it's
>> not likely to do so any time in the future since nobody appears to be
>> working on multi-device write performance.
>
> I think that's only true if you don't account the seek overhead. In
> single device RAID1 mode you will always seek half of the device while
> writing data, and even when reading between odd and even PIDs. In
> contrast, DUP mode doesn't guarantee your seeks to be shorter but from
> a statistical point of view, on the average it should be shorter. So it
> should yield better performance (tho I wouldn't expect it to be
> observable, depending on your workload).
>
> So, on devices having no seek overhead (aka SSD), it is probably true
> (minus bus bandwidth considerations). For HDD I'd prefer DUP.
>
> From data safety point of view: It's more likely that adjacent
> and nearby sectors are bad. So DUP imposes a higher risk of written
> data being written to only bad sectors - which means data loss or even
> file system loss (if metadata hits this problem).
>
> To be realistic: I wouldn't trade space usage for duplicate data on an
> already failing disk, no matter if it's DUP or RAID1. HDD disk space is
> cheap, and using such a scenario is just waste of performance AND
> space - no matter what. I don't understand the purpose of this. It just
> results in fake safety.
>
> Better get two separate devices half the size. There's a better chance
> of getting a better cost/space ratio anyways, plus better performance
> and safety.
>
>> There's also the fact that you're writing more metadata than data
>> most of the time unless you're dealing with really big files, and
>> metadata is already DUP mode (unless you are using an SSD), so the
>> performance hit isn't 50%, it's actually a bit more than half the
>> ratio of data writes to metadata writes.
>>>
>>>> On a related note, I see this caveat about dup in the manpage:
>>>>
>>>> "For example, a SSD drive can remap the blocks internally to a
>>>> single copy thus deduplicating them. This negates the purpose of
>>>> increased redunancy (sic) and just wastes space"
>>>
>>> That ability is vastly overestimated in the man page. There is no
>>> miracle content-addressable storage system working at 500 MB/sec
>>> speeds all within a little cheap controller on SSDs. Likely most of
>>> what it can do, is just compress simple stuff, such as runs of
>>> zeroes or other repeating byte sequences.
>> Most of those that do in-line compression don't implement it in
>> firmware, they implement it in hardware, and even DEFLATE can get 500
>> MB/second speeds if properly implemented in hardware. The firmware
>> may control how the hardware works, but it's usually hardware doing
>> heavy lifting in that case, and getting a good ASIC made that can hit
>> the required performance point for a reasonable compression algorithm
>> like LZ4 or Snappy is insanely cheap once you've gotten past the VLSI
>> work.
>
> I still thinks it's a myth... The overhead of managing inline
> deduplication is just way too high to implement it without jumping
> through expensive hoops. Most workloads have almost zero deduplication
> potential. And even when, their temporal occurrence is spaced so far
> that an inline deduplicator won't catch it.
Just like the proposed implementation in BTRFS, it's not complete
deduplication. In fact, the only devices I've ever seen that do this
appear to implement it just like what was proposed for BTRFS, just with
a much smaller cache. They were also insanely expensive.
>
> If it would be all so easy, btrfs would already have it working in
> mainline. I don't even remember that those patches is still being
> worked on.
>
> With this in mind, I think dup metadata is still a good think to have
> even on SSD and I would always force to enable it.
Agreed.
>
> Potential for deduplication is only when using snapshots (which already
> are deduplicated when taken) or when handling user data on a file
> server in a multi-user environment. Users tend to copy their files all
> over the place - multiple directories of multiple gigabytes. Potential
> is also where you're working with client machine backups or vm images.
> I regularly see deduplication efficiency of 30-60% in such scenarios -
> file servers mostly which I'm handling. But due to temporally far
> spaced occurrence of duplicate blocks, only offline or nearline
> deduplication works here.
>
>>> And the DUP mode is still useful on SSDs, for cases when one copy
>>> of the DUP gets corrupted in-flight due to a bad controller or RAM
>>> or cable, you could then restore that block from its good-CRC DUP
>>> copy.
>> The only window of time during which bad RAM could result in only one
>> copy of a block being bad is after the first copy is written but
>> before the second is, which is usually an insanely small amount of
>> time. As far as the cabling, the window for errors resulting in a
>> single bad copy of a block is pretty much the same as for RAM, and if
>> they're persistently bad, you're more likely to lose data for other
>> reasons.
>
> It depends on the design of the software. You're true if this memory
> block is simply a single block throughout its lifetime in RAM before
> written to storage. But if it is already handled as duplicate block in
> memory, odds are different. I hope btrfs is doing this right... ;-)
It's pretty debatable whether or not handling things as duplicates in
RAM is correct or not. Memory has higher error rates than most storage
media, but it also is much more reasonable to expect it to have good
EDAC mechanisms that most storage media.
>
>> That said, I do still feel that DUP mode has value on SSD's. The
>> primary arguments against it are:
>> 1. It wears out the SSD faster.
>
> I don't think this is a huge factor, even more when looking at TBW
> capabilities of modern SSDs. And prices are low enough to better swap
> early than waiting for the disaster hitting you. Instead, you can still
> use the old SSD for archival storage (but this has drawbacks, don't
> leave them without power for months or years!) or as a shock resistent
> USB mobile drive on the go.
>
>> 2. The blocks are likely to end up in the same erase block, and
>> therefore there will be no benefit.
>
> Oh, this is probably a point to really think about... Would ssd_spread
> help here?
Not really, the ssd* mount options affect the chunk allocator only last
I knew.
>
>> The first argument is accurate, but not usually an issue for most
>> people. Average life expectancy for a decent SSD is well over 10
>> years, which is more than twice the usual life expectancy for a
>> consumer hard drive.
>
> Well, my first SSD (128 GB) was worn (according to SMART) after only 12
> months. Bigger drives wear much slower. I now have a 500 GB SSD and
> looking at SMART it projects to serve me well for the next 3-4 years
> or longer. But it will be worn out then. But I'm pretty sure I'll get a
> new drive until then - for performance and space reasons. My high usage
> pattern probably results from using the drives for bcache in write-back
> mode. Btrfs as the bcache user does it's own job (because of CoW) of
> pressing much more data through bcache than normal expectations.
FWIW, the quote I gave (which I didn't properly qualify for some
reason...) Is with respect to the 2 Crucial MX200 SSD's I have in my
home server system, which is primarily running BOINC apps most of the
time. Some brands are of course better than others (Kingston drives for
example seem to have paradoxically short life spans in my experience).
>
>> As far as the second argument against it, that one is partially
>> correct, but ignores an important factor that many people who don't
>> do hardware design (and some who do) don't often consider. The close
>> temporal proximity of the writes for each copy are likely to mean
>> they end up in the same erase block on the SSD (especially if the SSD
>> has a large write cache).
>
> Deja vu...
>
>> However, that doesn't mean that one
>> getting corrupted due to device failure is guaranteed to corrupt the
>> other. The reason for this is exactly the same reason that single
>> word errors in RAM are exponentially more common than losing a whole
>> chip or the whole memory module: The primary error source is
>> environmental noise (EMI, cosmic rays, quantum interference,
>> background radiation, etc), not system failure. In other words,
>> you're far more likely to lose a single cell (which is usually not
>> more than a single byte in the MLC flash that gets used in most
>> modern SSD's) in the erase block than the whole erase block. In that
>> event, you obviously have only got corruption in the particular
>> filesystem block that that particular cell was storing data for.
>
> Sounds reasonable...
>
>> There's also a third argument for not using DUP on SSD's however:
>> The SSD already does most of the data integrity work itself.
>
> DUP is really not for integrity but for consistency. If one copy of the
> block becomes damaged for perfectly reasonable instructions sent by the
> OS (from the drive firmware perspective), that block has perfect data
> integrity. But if it was the single copy of a metadata block, your FS
> is probably toast now. In DUP mode you still have the other copy for
> consistent filesystem structures. With this copy, the OS can now restore
> filesystem integrity (which is levels above block level integrity).
>
That's still data integrity from the filesystem and userspace's perspective.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2017-02-08 13:13 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <CACNDjuzntG5Saq5HHNeDUmq-=28riKAerkO=CD=zAW-QofbKSg@mail.gmail.com>
2017-01-19 16:39 ` Fwd: dup vs raid1 in single disk Alejandro R. Mosteo
2017-01-19 17:06 ` Austin S. Hemmelgarn
2017-01-19 18:23 ` Roman Mamedov
2017-01-19 20:02 ` Austin S. Hemmelgarn
2017-01-21 16:00 ` Alejandro R. Mosteo
2017-02-07 22:28 ` Kai Krakow
2017-02-07 22:46 ` Hans van Kranenburg
2017-02-08 0:39 ` Dan Mons
2017-02-08 9:14 ` Alejandro R. Mosteo
2017-02-08 13:02 ` Austin S. Hemmelgarn
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.