All of lore.kernel.org
 help / color / mirror / Atom feed
* Fwd: dup vs raid1 in single disk
       [not found] <CACNDjuzntG5Saq5HHNeDUmq-=28riKAerkO=CD=zAW-QofbKSg@mail.gmail.com>
@ 2017-01-19 16:39 ` Alejandro R. Mosteo
  2017-01-19 17:06   ` Austin S. Hemmelgarn
  2017-01-19 18:23   ` Roman Mamedov
  0 siblings, 2 replies; 10+ messages in thread
From: Alejandro R. Mosteo @ 2017-01-19 16:39 UTC (permalink / raw)
  To: linux-btrfs

Hello list,

I was wondering, from a point of view of data safety, if there is any
difference between using dup or making a raid1 from two partitions in
the same disk. This is thinking on having some protection against the
typical aging HDD that starts to have bad sectors.

On a related note, I see this caveat about dup in the manpage:

"For example, a SSD drive can remap the blocks internally to a single
copy thus deduplicating them. This negates the purpose of increased
redunancy (sic) and just wastes space"

SSDs failure modes are different (more an all or nothing thing, I'm
told) so it wouldn't apply to the use case above, but I'm curious for
curiosity's sake if there would be any difference too.

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Fwd: dup vs raid1 in single disk
  2017-01-19 16:39 ` Fwd: dup vs raid1 in single disk Alejandro R. Mosteo
@ 2017-01-19 17:06   ` Austin S. Hemmelgarn
  2017-01-19 18:23   ` Roman Mamedov
  1 sibling, 0 replies; 10+ messages in thread
From: Austin S. Hemmelgarn @ 2017-01-19 17:06 UTC (permalink / raw)
  To: Alejandro R. Mosteo, linux-btrfs

On 2017-01-19 11:39, Alejandro R. Mosteo wrote:
> Hello list,
>
> I was wondering, from a point of view of data safety, if there is any
> difference between using dup or making a raid1 from two partitions in
> the same disk. This is thinking on having some protection against the
> typical aging HDD that starts to have bad sectors.
>
> On a related note, I see this caveat about dup in the manpage:
>
> "For example, a SSD drive can remap the blocks internally to a single
> copy thus deduplicating them. This negates the purpose of increased
> redunancy (sic) and just wastes space"
>
> SSDs failure modes are different (more an all or nothing thing, I'm
> told) so it wouldn't apply to the use case above, but I'm curious for
> curiosity's sake if there would be any difference too.

On a traditional HDD, there actually is a reasonable safety benefit to 
using 2 partitions in raid1 mode over using dup mode.  This is because 
most traditional HDD firmware still keeps the mapping of physical 
sectors to logical sectors mostly linear, so having separate partitions 
will (usually) mean that the two copies are not located near each other 
on physical media.  A similar but weaker version of the same effect can 
be achieved by using the 'ssd_spread' mount option, but I would not 
suggest relying on that.  This doesn't apply to hybrid drives (because 
they move stuff around however they want like SSD's), or SMR drives 
(because they rewrite large portions of the disk when one place gets 
rewritten, so physical separation of the data copies doesn't get you as 
much protection).

For most SSD's, there is no practical benefit because the FTL in the SSD 
firmware generally maps physical sectors to logical sectors in whatever 
arbitrary way it wants, which is usually not going to be linear.

As far as failure modes on an SSD, you usually see one of two things 
happen, either the whole disk starts acting odd (or stops working), or 
individual blocks a few MB in size (which seem to move around the disk 
as they get over-written) start behaving odd.  The first case is the 
firmware or primary electronics going bad, while the second is 
individual erase blocks going bad.  As a general rule, SSD's will run 
longer as they're going bad than HDD's will, but in both cases you 
should look at replacing the device once you start seeing the error 
counters going up consistently over time (or if you see them suddenly 
jump to a much higher number).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: dup vs raid1 in single disk
  2017-01-19 16:39 ` Fwd: dup vs raid1 in single disk Alejandro R. Mosteo
  2017-01-19 17:06   ` Austin S. Hemmelgarn
@ 2017-01-19 18:23   ` Roman Mamedov
  2017-01-19 20:02     ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 10+ messages in thread
From: Roman Mamedov @ 2017-01-19 18:23 UTC (permalink / raw)
  To: Alejandro R. Mosteo; +Cc: linux-btrfs

On Thu, 19 Jan 2017 17:39:37 +0100
"Alejandro R. Mosteo" <alejandro@mosteo.com> wrote:

> I was wondering, from a point of view of data safety, if there is any
> difference between using dup or making a raid1 from two partitions in
> the same disk. This is thinking on having some protection against the
> typical aging HDD that starts to have bad sectors.

RAID1 will write slower compared to DUP, as any optimization to make RAID1
devices work in parallel will cause a total performance disaster for you as
you will start trying to write to both partitions at the same time, turning
all linear writes into random ones, which are about two orders of magnitude
slower than linear on spinning hard drives. DUP shouldn't have this issue, but
still it will be twice slower than single, since you are writing everything
twice.

You could consider DUP data for when a disk is already known to be getting bad
sectors from time to time -- but then it's a fringe exercise to try and keep
using such disk in the first place. Yeah with DUP data DUP metadata you can
likely have some more life out of such disk as a throwaway storage space for
non-essential data, at half capacity, but is it worth the effort, as it's
likely to start failing progressively worse over time.

In all other cases the performance and storage space penalty of DUP within a
single device are way too great (and gained redundancy is too low) compared
to a proper system of single profile data + backups, or a RAID5/6 system (not
Btrfs-based) + backups.

> On a related note, I see this caveat about dup in the manpage:
> 
> "For example, a SSD drive can remap the blocks internally to a single
> copy thus deduplicating them. This negates the purpose of increased
> redunancy (sic) and just wastes space"

That ability is vastly overestimated in the man page. There is no miracle
content-addressable storage system working at 500 MB/sec speeds all within a
little cheap controller on SSDs. Likely most of what it can do, is just
compress simple stuff, such as runs of zeroes or other repeating byte
sequences.

And the DUP mode is still useful on SSDs, for cases when one copy of the DUP
gets corrupted in-flight due to a bad controller or RAM or cable, you could
then restore that block from its good-CRC DUP copy.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: dup vs raid1 in single disk
  2017-01-19 18:23   ` Roman Mamedov
@ 2017-01-19 20:02     ` Austin S. Hemmelgarn
  2017-01-21 16:00       ` Alejandro R. Mosteo
  2017-02-07 22:28       ` Kai Krakow
  0 siblings, 2 replies; 10+ messages in thread
From: Austin S. Hemmelgarn @ 2017-01-19 20:02 UTC (permalink / raw)
  To: Roman Mamedov, Alejandro R. Mosteo; +Cc: linux-btrfs

On 2017-01-19 13:23, Roman Mamedov wrote:
> On Thu, 19 Jan 2017 17:39:37 +0100
> "Alejandro R. Mosteo" <alejandro@mosteo.com> wrote:
>
>> I was wondering, from a point of view of data safety, if there is any
>> difference between using dup or making a raid1 from two partitions in
>> the same disk. This is thinking on having some protection against the
>> typical aging HDD that starts to have bad sectors.
>
> RAID1 will write slower compared to DUP, as any optimization to make RAID1
> devices work in parallel will cause a total performance disaster for you as
> you will start trying to write to both partitions at the same time, turning
> all linear writes into random ones, which are about two orders of magnitude
> slower than linear on spinning hard drives. DUP shouldn't have this issue, but
> still it will be twice slower than single, since you are writing everything
> twice.
As of right now, there will actually be near zero impact on write 
performance (or at least, it's way less than the theoretical 50%) 
because there really isn't any optimization to speak of in the 
multi-device code.  That will hopefully change over time, but it's not 
likely to do so any time in the future since nobody appears to be 
working on multi-device write performance.
>
> You could consider DUP data for when a disk is already known to be getting bad
> sectors from time to time -- but then it's a fringe exercise to try and keep
> using such disk in the first place. Yeah with DUP data DUP metadata you can
> likely have some more life out of such disk as a throwaway storage space for
> non-essential data, at half capacity, but is it worth the effort, as it's
> likely to start failing progressively worse over time.
>
> In all other cases the performance and storage space penalty of DUP within a
> single device are way too great (and gained redundancy is too low) compared
> to a proper system of single profile data + backups, or a RAID5/6 system (not
> Btrfs-based) + backups.
That really depends on your usage.  In my case, I run DUP data on single 
disks regularly.  I still do backups of course, but the performance is 
worth far less for me (especially in the cases where I'm using NVMe 
SSD's which have performance measured in thousands of MB/s for both 
reads and writes) than the ability to recover from transient data 
corruption without needing to go to a backup.

As long as /home and any other write heavy directories are on a separate 
partition, I would actually advocate using DUP data on your root 
filesystem if you can afford the space simply because it's a whole lot 
easier to recover other data if the root filesystem still works.  Most 
of the root filesystem except some stuff under /var follows a WORM 
access pattern, and even the stuff that doesn't in /var is usually not 
performance critical, so the write performance penalty won't have 
anywhere near as much impact on how well the system runs as you might think.

There's also the fact that you're writing more metadata than data most 
of the time unless you're dealing with really big files, and metadata is 
already DUP mode (unless you are using an SSD), so the performance hit 
isn't 50%, it's actually a bit more than half the ratio of data writes 
to metadata writes.
>
>> On a related note, I see this caveat about dup in the manpage:
>>
>> "For example, a SSD drive can remap the blocks internally to a single
>> copy thus deduplicating them. This negates the purpose of increased
>> redunancy (sic) and just wastes space"
>
> That ability is vastly overestimated in the man page. There is no miracle
> content-addressable storage system working at 500 MB/sec speeds all within a
> little cheap controller on SSDs. Likely most of what it can do, is just
> compress simple stuff, such as runs of zeroes or other repeating byte
> sequences.
Most of those that do in-line compression don't implement it in 
firmware, they implement it in hardware, and even DEFLATE can get 500 
MB/second speeds if properly implemented in hardware.  The firmware may 
control how the hardware works, but it's usually hardware doing heavy 
lifting in that case, and getting a good ASIC made that can hit the 
required performance point for a reasonable compression algorithm like 
LZ4 or Snappy is insanely cheap once you've gotten past the VLSI work.
>
> And the DUP mode is still useful on SSDs, for cases when one copy of the DUP
> gets corrupted in-flight due to a bad controller or RAM or cable, you could
> then restore that block from its good-CRC DUP copy.
The only window of time during which bad RAM could result in only one 
copy of a block being bad is after the first copy is written but before 
the second is, which is usually an insanely small amount of time.  As 
far as the cabling, the window for errors resulting in a single bad copy 
of a block is pretty much the same as for RAM, and if they're 
persistently bad, you're more likely to lose data for other reasons.

That said, I do still feel that DUP mode has value on SSD's.  The 
primary arguments against it are:
1. It wears out the SSD faster.
2. The blocks are likely to end up in the same erase block, and 
therefore there will be no benefit.

The first argument is accurate, but not usually an issue for most 
people.  Average life expectancy for a decent SSD is well over 10 years, 
which is more than twice the usual life expectancy for a consumer hard 
drive.  Putting it in further perspective, the 575GB SSD's have been 
running essentially 24/7 for the past year and a half (13112 hours 
powered on now), and have seen just short of 25.7TB of writes over that 
time.  This equates to roughly 2GB/hour, which is well within typical 
desktop usage.  It also means they've seen more than 44.5 times their 
total capacity in writes.  Despite this, the wear-out indicators all 
show that I can still expect at least 9 years more of run-time on these. 
  Normalizing that, that means I'm likely to see between 8 and 12 years 
of life on these.  Equivalent stats for the HDD's I used to use (NAS 
rated Seagate drives) gave me a roughly 3-5 year life expectancy, less 
than half that of the SSD.  In both cases however, you're talking well 
beyond the typical life expectancy of anything short of a server or a 
tight-embedded system, and worrying about a 4-year versus 8-year life 
expectancy on your storage device is kind of pointless when you need to 
upgrade the rest of the system in 3 years.

As far as the second argument against it, that one is partially correct, 
but ignores an important factor that many people who don't do hardware 
design (and some who do) don't often consider.  The close temporal 
proximity of the writes for each copy are likely to mean they end up in 
the same erase block on the SSD (especially if the SSD has a large write 
cache).  However, that doesn't mean that one getting corrupted due to 
device failure is guaranteed to corrupt the other.  The reason for this 
is exactly the same reason that single word errors in RAM are 
exponentially more common than losing a whole chip or the whole memory 
module: The primary error source is environmental noise (EMI, cosmic 
rays, quantum interference, background radiation, etc), not system 
failure.  In other words, you're far more likely to lose a single cell 
(which is usually not more than a single byte in the MLC flash that gets 
used in most modern SSD's) in the erase block than the whole erase 
block.  In that event, you obviously have only got corruption in the 
particular filesystem block that that particular cell was storing data for.

There's also a third argument for not using DUP on SSD's however:
The SSD already does most of the data integrity work itself.
This is only true of good SSD's, but many do have some degree of 
built-in erasure coding in the firmware which can handle losing large 
chunks of an erase block and still return the data safely.  This is part 
of the reason that you almost never see nice power-of-two sizes for 
flash Storage despite flash chips being made that way them,selves (the 
other part is the spare blocks).  Depending on the degree of protection 
provided by this erasure coding, it can actually cancel out my argument 
against argument 2.  In all practicality though, that requires you to 
actually trust the SSD manufacturer to have implemented things properly 
for it to be a valid counter-argument, and most people who would care 
enough about data integrity to use BTRFS for that reason are not likely 
to trust the storage device that much.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: dup vs raid1 in single disk
  2017-01-19 20:02     ` Austin S. Hemmelgarn
@ 2017-01-21 16:00       ` Alejandro R. Mosteo
  2017-02-07 22:28       ` Kai Krakow
  1 sibling, 0 replies; 10+ messages in thread
From: Alejandro R. Mosteo @ 2017-01-21 16:00 UTC (permalink / raw)
  Cc: linux-btrfs

Thanks Austin and Roman for the interesting discussion.

Alex.

On 19/01/17 21:02, Austin S. Hemmelgarn wrote:
> On 2017-01-19 13:23, Roman Mamedov wrote:
>> On Thu, 19 Jan 2017 17:39:37 +0100
>> "Alejandro R. Mosteo" <alejandro@mosteo.com> wrote:
>>
>>> I was wondering, from a point of view of data safety, if there is any
>>> difference between using dup or making a raid1 from two partitions in
>>> the same disk. This is thinking on having some protection against the
>>> typical aging HDD that starts to have bad sectors.
>>
>> RAID1 will write slower compared to DUP, as any optimization to make 
>> RAID1
>> devices work in parallel will cause a total performance disaster for 
>> you as
>> you will start trying to write to both partitions at the same time, 
>> turning
>> all linear writes into random ones, which are about two orders of 
>> magnitude
>> slower than linear on spinning hard drives. DUP shouldn't have this 
>> issue, but
>> still it will be twice slower than single, since you are writing 
>> everything
>> twice.
> As of right now, there will actually be near zero impact on write 
> performance (or at least, it's way less than the theoretical 50%) 
> because there really isn't any optimization to speak of in the 
> multi-device code.  That will hopefully change over time, but it's not 
> likely to do so any time in the future since nobody appears to be 
> working on multi-device write performance.
>>
>> You could consider DUP data for when a disk is already known to be 
>> getting bad
>> sectors from time to time -- but then it's a fringe exercise to try 
>> and keep
>> using such disk in the first place. Yeah with DUP data DUP metadata 
>> you can
>> likely have some more life out of such disk as a throwaway storage 
>> space for
>> non-essential data, at half capacity, but is it worth the effort, as 
>> it's
>> likely to start failing progressively worse over time.
>>
>> In all other cases the performance and storage space penalty of DUP 
>> within a
>> single device are way too great (and gained redundancy is too low) 
>> compared
>> to a proper system of single profile data + backups, or a RAID5/6 
>> system (not
>> Btrfs-based) + backups.
> That really depends on your usage.  In my case, I run DUP data on 
> single disks regularly.  I still do backups of course, but the 
> performance is worth far less for me (especially in the cases where 
> I'm using NVMe SSD's which have performance measured in thousands of 
> MB/s for both reads and writes) than the ability to recover from 
> transient data corruption without needing to go to a backup.
>
> As long as /home and any other write heavy directories are on a 
> separate partition, I would actually advocate using DUP data on your 
> root filesystem if you can afford the space simply because it's a 
> whole lot easier to recover other data if the root filesystem still 
> works.  Most of the root filesystem except some stuff under /var 
> follows a WORM access pattern, and even the stuff that doesn't in /var 
> is usually not performance critical, so the write performance penalty 
> won't have anywhere near as much impact on how well the system runs as 
> you might think.
>
> There's also the fact that you're writing more metadata than data most 
> of the time unless you're dealing with really big files, and metadata 
> is already DUP mode (unless you are using an SSD), so the performance 
> hit isn't 50%, it's actually a bit more than half the ratio of data 
> writes to metadata writes.
>>
>>> On a related note, I see this caveat about dup in the manpage:
>>>
>>> "For example, a SSD drive can remap the blocks internally to a single
>>> copy thus deduplicating them. This negates the purpose of increased
>>> redunancy (sic) and just wastes space"
>>
>> That ability is vastly overestimated in the man page. There is no 
>> miracle
>> content-addressable storage system working at 500 MB/sec speeds all 
>> within a
>> little cheap controller on SSDs. Likely most of what it can do, is just
>> compress simple stuff, such as runs of zeroes or other repeating byte
>> sequences.
> Most of those that do in-line compression don't implement it in 
> firmware, they implement it in hardware, and even DEFLATE can get 500 
> MB/second speeds if properly implemented in hardware.  The firmware 
> may control how the hardware works, but it's usually hardware doing 
> heavy lifting in that case, and getting a good ASIC made that can hit 
> the required performance point for a reasonable compression algorithm 
> like LZ4 or Snappy is insanely cheap once you've gotten past the VLSI 
> work.
>>
>> And the DUP mode is still useful on SSDs, for cases when one copy of 
>> the DUP
>> gets corrupted in-flight due to a bad controller or RAM or cable, you 
>> could
>> then restore that block from its good-CRC DUP copy.
> The only window of time during which bad RAM could result in only one 
> copy of a block being bad is after the first copy is written but 
> before the second is, which is usually an insanely small amount of 
> time.  As far as the cabling, the window for errors resulting in a 
> single bad copy of a block is pretty much the same as for RAM, and if 
> they're persistently bad, you're more likely to lose data for other 
> reasons.
>
> That said, I do still feel that DUP mode has value on SSD's.  The 
> primary arguments against it are:
> 1. It wears out the SSD faster.
> 2. The blocks are likely to end up in the same erase block, and 
> therefore there will be no benefit.
>
> The first argument is accurate, but not usually an issue for most 
> people.  Average life expectancy for a decent SSD is well over 10 
> years, which is more than twice the usual life expectancy for a 
> consumer hard drive.  Putting it in further perspective, the 575GB 
> SSD's have been running essentially 24/7 for the past year and a half 
> (13112 hours powered on now), and have seen just short of 25.7TB of 
> writes over that time.  This equates to roughly 2GB/hour, which is 
> well within typical desktop usage.  It also means they've seen more 
> than 44.5 times their total capacity in writes.  Despite this, the 
> wear-out indicators all show that I can still expect at least 9 years 
> more of run-time on these.  Normalizing that, that means I'm likely to 
> see between 8 and 12 years of life on these.  Equivalent stats for the 
> HDD's I used to use (NAS rated Seagate drives) gave me a roughly 3-5 
> year life expectancy, less than half that of the SSD.  In both cases 
> however, you're talking well beyond the typical life expectancy of 
> anything short of a server or a tight-embedded system, and worrying 
> about a 4-year versus 8-year life expectancy on your storage device is 
> kind of pointless when you need to upgrade the rest of the system in 3 
> years.
>
> As far as the second argument against it, that one is partially 
> correct, but ignores an important factor that many people who don't do 
> hardware design (and some who do) don't often consider. The close 
> temporal proximity of the writes for each copy are likely to mean they 
> end up in the same erase block on the SSD (especially if the SSD has a 
> large write cache).  However, that doesn't mean that one getting 
> corrupted due to device failure is guaranteed to corrupt the other.  
> The reason for this is exactly the same reason that single word errors 
> in RAM are exponentially more common than losing a whole chip or the 
> whole memory module: The primary error source is environmental noise 
> (EMI, cosmic rays, quantum interference, background radiation, etc), 
> not system failure.  In other words, you're far more likely to lose a 
> single cell (which is usually not more than a single byte in the MLC 
> flash that gets used in most modern SSD's) in the erase block than the 
> whole erase block.  In that event, you obviously have only got 
> corruption in the particular filesystem block that that particular 
> cell was storing data for.
>
> There's also a third argument for not using DUP on SSD's however:
> The SSD already does most of the data integrity work itself.
> This is only true of good SSD's, but many do have some degree of 
> built-in erasure coding in the firmware which can handle losing large 
> chunks of an erase block and still return the data safely. This is 
> part of the reason that you almost never see nice power-of-two sizes 
> for flash Storage despite flash chips being made that way them,selves 
> (the other part is the spare blocks). Depending on the degree of 
> protection provided by this erasure coding, it can actually cancel out 
> my argument against argument 2.  In all practicality though, that 
> requires you to actually trust the SSD manufacturer to have 
> implemented things properly for it to be a valid counter-argument, and 
> most people who would care enough about data integrity to use BTRFS 
> for that reason are not likely to trust the storage device that much.
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: dup vs raid1 in single disk
  2017-01-19 20:02     ` Austin S. Hemmelgarn
  2017-01-21 16:00       ` Alejandro R. Mosteo
@ 2017-02-07 22:28       ` Kai Krakow
  2017-02-07 22:46         ` Hans van Kranenburg
                           ` (3 more replies)
  1 sibling, 4 replies; 10+ messages in thread
From: Kai Krakow @ 2017-02-07 22:28 UTC (permalink / raw)
  To: linux-btrfs

Am Thu, 19 Jan 2017 15:02:14 -0500
schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> On 2017-01-19 13:23, Roman Mamedov wrote:
> > On Thu, 19 Jan 2017 17:39:37 +0100
> > "Alejandro R. Mosteo" <alejandro@mosteo.com> wrote:
> >  
> >> I was wondering, from a point of view of data safety, if there is
> >> any difference between using dup or making a raid1 from two
> >> partitions in the same disk. This is thinking on having some
> >> protection against the typical aging HDD that starts to have bad
> >> sectors.  
> >
> > RAID1 will write slower compared to DUP, as any optimization to
> > make RAID1 devices work in parallel will cause a total performance
> > disaster for you as you will start trying to write to both
> > partitions at the same time, turning all linear writes into random
> > ones, which are about two orders of magnitude slower than linear on
> > spinning hard drives. DUP shouldn't have this issue, but still it
> > will be twice slower than single, since you are writing everything
> > twice.  
> As of right now, there will actually be near zero impact on write 
> performance (or at least, it's way less than the theoretical 50%) 
> because there really isn't any optimization to speak of in the 
> multi-device code.  That will hopefully change over time, but it's
> not likely to do so any time in the future since nobody appears to be 
> working on multi-device write performance.

I think that's only true if you don't account the seek overhead. In
single device RAID1 mode you will always seek half of the device while
writing data, and even when reading between odd and even PIDs. In
contrast, DUP mode doesn't guarantee your seeks to be shorter but from
a statistical point of view, on the average it should be shorter. So it
should yield better performance (tho I wouldn't expect it to be
observable, depending on your workload).

So, on devices having no seek overhead (aka SSD), it is probably true
(minus bus bandwidth considerations). For HDD I'd prefer DUP.

>From data safety point of view: It's more likely that adjacent
and nearby sectors are bad. So DUP imposes a higher risk of written
data being written to only bad sectors - which means data loss or even
file system loss (if metadata hits this problem).

To be realistic: I wouldn't trade space usage for duplicate data on an
already failing disk, no matter if it's DUP or RAID1. HDD disk space is
cheap, and using such a scenario is just waste of performance AND
space - no matter what. I don't understand the purpose of this. It just
results in fake safety.

Better get two separate devices half the size. There's a better chance
of getting a better cost/space ratio anyways, plus better performance
and safety.

> There's also the fact that you're writing more metadata than data
> most of the time unless you're dealing with really big files, and
> metadata is already DUP mode (unless you are using an SSD), so the
> performance hit isn't 50%, it's actually a bit more than half the
> ratio of data writes to metadata writes.
> >  
> >> On a related note, I see this caveat about dup in the manpage:
> >>
> >> "For example, a SSD drive can remap the blocks internally to a
> >> single copy thus deduplicating them. This negates the purpose of
> >> increased redunancy (sic) and just wastes space"  
> >
> > That ability is vastly overestimated in the man page. There is no
> > miracle content-addressable storage system working at 500 MB/sec
> > speeds all within a little cheap controller on SSDs. Likely most of
> > what it can do, is just compress simple stuff, such as runs of
> > zeroes or other repeating byte sequences.  
> Most of those that do in-line compression don't implement it in 
> firmware, they implement it in hardware, and even DEFLATE can get 500 
> MB/second speeds if properly implemented in hardware.  The firmware
> may control how the hardware works, but it's usually hardware doing
> heavy lifting in that case, and getting a good ASIC made that can hit
> the required performance point for a reasonable compression algorithm
> like LZ4 or Snappy is insanely cheap once you've gotten past the VLSI
> work.

I still thinks it's a myth... The overhead of managing inline
deduplication is just way too high to implement it without jumping
through expensive hoops. Most workloads have almost zero deduplication
potential. And even when, their temporal occurrence is spaced so far
that an inline deduplicator won't catch it.

If it would be all so easy, btrfs would already have it working in
mainline. I don't even remember that those patches is still being
worked on.

With this in mind, I think dup metadata is still a good think to have
even on SSD and I would always force to enable it.

Potential for deduplication is only when using snapshots (which already
are deduplicated when taken) or when handling user data on a file
server in a multi-user environment. Users tend to copy their files all
over the place - multiple directories of multiple gigabytes. Potential
is also where you're working with client machine backups or vm images.
I regularly see deduplication efficiency of 30-60% in such scenarios -
file servers mostly which I'm handling. But due to temporally far
spaced occurrence of duplicate blocks, only offline or nearline
deduplication works here.

> > And the DUP mode is still useful on SSDs, for cases when one copy
> > of the DUP gets corrupted in-flight due to a bad controller or RAM
> > or cable, you could then restore that block from its good-CRC DUP
> > copy.  
> The only window of time during which bad RAM could result in only one 
> copy of a block being bad is after the first copy is written but
> before the second is, which is usually an insanely small amount of
> time.  As far as the cabling, the window for errors resulting in a
> single bad copy of a block is pretty much the same as for RAM, and if
> they're persistently bad, you're more likely to lose data for other
> reasons.

It depends on the design of the software. You're true if this memory
block is simply a single block throughout its lifetime in RAM before
written to storage. But if it is already handled as duplicate block in
memory, odds are different. I hope btrfs is doing this right... ;-)

> That said, I do still feel that DUP mode has value on SSD's.  The 
> primary arguments against it are:
> 1. It wears out the SSD faster.

I don't think this is a huge factor, even more when looking at TBW
capabilities of modern SSDs. And prices are low enough to better swap
early than waiting for the disaster hitting you. Instead, you can still
use the old SSD for archival storage (but this has drawbacks, don't
leave them without power for months or years!) or as a shock resistent
USB mobile drive on the go.

> 2. The blocks are likely to end up in the same erase block, and 
> therefore there will be no benefit.

Oh, this is probably a point to really think about... Would ssd_spread
help here?

> The first argument is accurate, but not usually an issue for most 
> people.  Average life expectancy for a decent SSD is well over 10
> years, which is more than twice the usual life expectancy for a
> consumer hard drive.

Well, my first SSD (128 GB) was worn (according to SMART) after only 12
months. Bigger drives wear much slower. I now have a 500 GB SSD and
looking at SMART it projects to serve me well for the next 3-4 years
or longer. But it will be worn out then. But I'm pretty sure I'll get a
new drive until then - for performance and space reasons. My high usage
pattern probably results from using the drives for bcache in write-back
mode. Btrfs as the bcache user does it's own job (because of CoW) of
pressing much more data through bcache than normal expectations.

> As far as the second argument against it, that one is partially
> correct, but ignores an important factor that many people who don't
> do hardware design (and some who do) don't often consider.  The close
> temporal proximity of the writes for each copy are likely to mean
> they end up in the same erase block on the SSD (especially if the SSD
> has a large write cache).

Deja vu...

>  However, that doesn't mean that one
> getting corrupted due to device failure is guaranteed to corrupt the
> other.  The reason for this is exactly the same reason that single
> word errors in RAM are exponentially more common than losing a whole
> chip or the whole memory module: The primary error source is
> environmental noise (EMI, cosmic rays, quantum interference,
> background radiation, etc), not system failure.  In other words,
> you're far more likely to lose a single cell (which is usually not
> more than a single byte in the MLC flash that gets used in most
> modern SSD's) in the erase block than the whole erase block.  In that
> event, you obviously have only got corruption in the particular
> filesystem block that that particular cell was storing data for.

Sounds reasonable...

> There's also a third argument for not using DUP on SSD's however:
> The SSD already does most of the data integrity work itself.

DUP is really not for integrity but for consistency. If one copy of the
block becomes damaged for perfectly reasonable instructions sent by the
OS (from the drive firmware perspective), that block has perfect data
integrity. But if it was the single copy of a metadata block, your FS
is probably toast now. In DUP mode you still have the other copy for
consistent filesystem structures. With this copy, the OS can now restore
filesystem integrity (which is levels above block level integrity).


-- 
Regards,
Kai

Replies to list-only preferred.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: dup vs raid1 in single disk
  2017-02-07 22:28       ` Kai Krakow
@ 2017-02-07 22:46         ` Hans van Kranenburg
  2017-02-08  0:39         ` Dan Mons
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Hans van Kranenburg @ 2017-02-07 22:46 UTC (permalink / raw)
  To: Kai Krakow, linux-btrfs

On 02/07/2017 11:28 PM, Kai Krakow wrote:
> Am Thu, 19 Jan 2017 15:02:14 -0500
> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> 
>> On 2017-01-19 13:23, Roman Mamedov wrote:
>>> On Thu, 19 Jan 2017 17:39:37 +0100
>>> [...]
>>> And the DUP mode is still useful on SSDs, for cases when one copy
>>> of the DUP gets corrupted in-flight due to a bad controller or RAM
>>> or cable, you could then restore that block from its good-CRC DUP
>>> copy.  
>> The only window of time during which bad RAM could result in only one 
>> copy of a block being bad is after the first copy is written but
>> before the second is, which is usually an insanely small amount of
>> time.  As far as the cabling, the window for errors resulting in a
>> single bad copy of a block is pretty much the same as for RAM, and if
>> they're persistently bad, you're more likely to lose data for other
>> reasons.
> 
> It depends on the design of the software. You're true if this memory
> block is simply a single block throughout its lifetime in RAM before
> written to storage. But if it is already handled as duplicate block in
> memory, odds are different. I hope btrfs is doing this right... ;-)

In memory, it's just one copy, happily sitting around, getting corrupted
by cosmic rays and other stuff done to it by aliens, after which a valid
checksum is calculated for the corrupt data, after which it goes on its
way to disk, twice. Yay.

>> That said, I do still feel that DUP mode has value on SSD's.  The 
>> primary arguments against it are:
>> 1. It wears out the SSD faster.
> 
> I don't think this is a huge factor, even more when looking at TBW
> capabilities of modern SSDs. And prices are low enough to better swap
> early than waiting for the disaster hitting you. Instead, you can still
> use the old SSD for archival storage (but this has drawbacks, don't
> leave them without power for months or years!) or as a shock resistent
> USB mobile drive on the go.
> 
>> 2. The blocks are likely to end up in the same erase block, and 
>> therefore there will be no benefit.
> 
> Oh, this is probably a point to really think about... Would ssd_spread
> help here?

I think there was another one, SSD firmware deduplicating writes,
converting the DUP into single again, giving a false idea of it being DUP.

This is one that can be solved by e.g. using disk encryption, which
causes same writes to show up as different data on disk.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: dup vs raid1 in single disk
  2017-02-07 22:28       ` Kai Krakow
  2017-02-07 22:46         ` Hans van Kranenburg
@ 2017-02-08  0:39         ` Dan Mons
  2017-02-08  9:14         ` Alejandro R. Mosteo
  2017-02-08 13:02         ` Austin S. Hemmelgarn
  3 siblings, 0 replies; 10+ messages in thread
From: Dan Mons @ 2017-02-08  0:39 UTC (permalink / raw)
  To: Kai Krakow; +Cc: linux-btrfs

On 8 February 2017 at 08:28, Kai Krakow <hurikhan77@gmail.com> wrote:
> I still thinks it's a myth... The overhead of managing inline
> deduplication is just way too high to implement it without jumping
> through expensive hoops. Most workloads have almost zero deduplication
> potential. And even when, their temporal occurrence is spaced so far
> that an inline deduplicator won't catch it.
>
> If it would be all so easy, btrfs would already have it working in
> mainline. I don't even remember that those patches is still being
> worked on.
>
> With this in mind, I think dup metadata is still a good think to have
> even on SSD and I would always force to enable it.
>
> Potential for deduplication is only when using snapshots (which already
> are deduplicated when taken) or when handling user data on a file
> server in a multi-user environment. Users tend to copy their files all
> over the place - multiple directories of multiple gigabytes. Potential
> is also where you're working with client machine backups or vm images.
> I regularly see deduplication efficiency of 30-60% in such scenarios -
> file servers mostly which I'm handling. But due to temporally far
> spaced occurrence of duplicate blocks, only offline or nearline
> deduplication works here.

I'm a sysadmin by trade, managing many PB of storage for a media
company.  Our primary storage are Oracle ZFS appliances, and all of
our secondary/nearline storage is Linux+BtrFS.

ZFS's inline deduplication is awful.  It consumes enormous amounts of
RAM that is orders of magnitude more valuable as ARC/Cache, and
becomes immediately useless whenever a storage node is rebooted
(necessary to apply mandatory security patches) and the in-memory
tables are lost (meaning cold data is rarely re-examined, and the
inline dedup becomes less efficient).

Conversely, I use  "dupremove" as a one-shot/offline deduplication
tool on all of our BtrFS storage.  I can be set as a cron job to be
done outside of business hours, and use an SQLite database to store
the necessary dedup hash information on disk, rather than in RAM.
>From the point of view of someone who manages large amounts of long
term centralised storage, this is a far superior way to deal with
deduplication, as it offers more flexibility and far better
space-saving ratios at a lower memory cost.

We trialled ZFS dedup for a few months, and decided to turn it off, as
there was far less benefit to ZFS using all that RAM for dedup than
there was for it to be cache.  I've been requesting Oracle offer a
similar offline dedup tool for their ZFS appliance for a very long
time, and if BtrFS ever did offer inline dedup, I wouldn't bother
using it for all of the reasons above.

-Dan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: dup vs raid1 in single disk
  2017-02-07 22:28       ` Kai Krakow
  2017-02-07 22:46         ` Hans van Kranenburg
  2017-02-08  0:39         ` Dan Mons
@ 2017-02-08  9:14         ` Alejandro R. Mosteo
  2017-02-08 13:02         ` Austin S. Hemmelgarn
  3 siblings, 0 replies; 10+ messages in thread
From: Alejandro R. Mosteo @ 2017-02-08  9:14 UTC (permalink / raw)
  To: Kai Krakow, linux-btrfs

On 07/02/17 23:28, Kai Krakow wrote:
> To be realistic: I wouldn't trade space usage for duplicate data on an
> already failing disk, no matter if it's DUP or RAID1. HDD disk space is
> cheap, and using such a scenario is just waste of performance AND
> space - no matter what. I don't understand the purpose of this. It just
> results in fake safety.
The disk is already replaced and no longer my workstation main drive. I 
work with large datasets in my research, and I don't care much about 
sustained I/O efficiency, since they're only read when needed. Hence, is 
a matter of juicing out the last life of that disk, instead of 
discarding it right away. This way I can have one extra local storage 
that may spare me the copy from a remote, so I prefer to play with it 
until it dies. Besides, it affords me a chance to play with btrfs/zfs in 
ways that I wouldn't normally risk, and I can also assess their behavior 
with a truly failing disk.

In the end, after a destructive write pass with badblocks, the disk 
increasing uncorrectable sectors have disappeared... go figure. So right 
now I have a btrfs filesystem built with single profile on top of four 
differently sized partitions. When/if bad blocks reappear I'll test some 
raid configuration; probably raidz unless btrfs raid5 is somewhat usable 
by then (why go with half a disk worth when you can have 2/3? ;-))

Thanks for your justified concern though.

Alex.

> Better get two separate devices half the size. There's a better chance
> of getting a better cost/space ratio anyways, plus better performance
> and safety.
>
>> There's also the fact that you're writing more metadata than data
>> most of the time unless you're dealing with really big files, and
>> metadata is already DUP mode (unless you are using an SSD), so the
>> performance hit isn't 50%, it's actually a bit more than half the
>> ratio of data writes to metadata writes.
>>>   
>>>> On a related note, I see this caveat about dup in the manpage:
>>>>
>>>> "For example, a SSD drive can remap the blocks internally to a
>>>> single copy thus deduplicating them. This negates the purpose of
>>>> increased redunancy (sic) and just wastes space"
>>> That ability is vastly overestimated in the man page. There is no
>>> miracle content-addressable storage system working at 500 MB/sec
>>> speeds all within a little cheap controller on SSDs. Likely most of
>>> what it can do, is just compress simple stuff, such as runs of
>>> zeroes or other repeating byte sequences.
>> Most of those that do in-line compression don't implement it in
>> firmware, they implement it in hardware, and even DEFLATE can get 500
>> MB/second speeds if properly implemented in hardware.  The firmware
>> may control how the hardware works, but it's usually hardware doing
>> heavy lifting in that case, and getting a good ASIC made that can hit
>> the required performance point for a reasonable compression algorithm
>> like LZ4 or Snappy is insanely cheap once you've gotten past the VLSI
>> work.
> I still thinks it's a myth... The overhead of managing inline
> deduplication is just way too high to implement it without jumping
> through expensive hoops. Most workloads have almost zero deduplication
> potential. And even when, their temporal occurrence is spaced so far
> that an inline deduplicator won't catch it.
>
> If it would be all so easy, btrfs would already have it working in
> mainline. I don't even remember that those patches is still being
> worked on.
>
> With this in mind, I think dup metadata is still a good think to have
> even on SSD and I would always force to enable it.
>
> Potential for deduplication is only when using snapshots (which already
> are deduplicated when taken) or when handling user data on a file
> server in a multi-user environment. Users tend to copy their files all
> over the place - multiple directories of multiple gigabytes. Potential
> is also where you're working with client machine backups or vm images.
> I regularly see deduplication efficiency of 30-60% in such scenarios -
> file servers mostly which I'm handling. But due to temporally far
> spaced occurrence of duplicate blocks, only offline or nearline
> deduplication works here.
>
>>> And the DUP mode is still useful on SSDs, for cases when one copy
>>> of the DUP gets corrupted in-flight due to a bad controller or RAM
>>> or cable, you could then restore that block from its good-CRC DUP
>>> copy.
>> The only window of time during which bad RAM could result in only one
>> copy of a block being bad is after the first copy is written but
>> before the second is, which is usually an insanely small amount of
>> time.  As far as the cabling, the window for errors resulting in a
>> single bad copy of a block is pretty much the same as for RAM, and if
>> they're persistently bad, you're more likely to lose data for other
>> reasons.
> It depends on the design of the software. You're true if this memory
> block is simply a single block throughout its lifetime in RAM before
> written to storage. But if it is already handled as duplicate block in
> memory, odds are different. I hope btrfs is doing this right... ;-)
>
>> That said, I do still feel that DUP mode has value on SSD's.  The
>> primary arguments against it are:
>> 1. It wears out the SSD faster.
> I don't think this is a huge factor, even more when looking at TBW
> capabilities of modern SSDs. And prices are low enough to better swap
> early than waiting for the disaster hitting you. Instead, you can still
> use the old SSD for archival storage (but this has drawbacks, don't
> leave them without power for months or years!) or as a shock resistent
> USB mobile drive on the go.
>
>> 2. The blocks are likely to end up in the same erase block, and
>> therefore there will be no benefit.
> Oh, this is probably a point to really think about... Would ssd_spread
> help here?
>
>> The first argument is accurate, but not usually an issue for most
>> people.  Average life expectancy for a decent SSD is well over 10
>> years, which is more than twice the usual life expectancy for a
>> consumer hard drive.
> Well, my first SSD (128 GB) was worn (according to SMART) after only 12
> months. Bigger drives wear much slower. I now have a 500 GB SSD and
> looking at SMART it projects to serve me well for the next 3-4 years
> or longer. But it will be worn out then. But I'm pretty sure I'll get a
> new drive until then - for performance and space reasons. My high usage
> pattern probably results from using the drives for bcache in write-back
> mode. Btrfs as the bcache user does it's own job (because of CoW) of
> pressing much more data through bcache than normal expectations.
>
>> As far as the second argument against it, that one is partially
>> correct, but ignores an important factor that many people who don't
>> do hardware design (and some who do) don't often consider.  The close
>> temporal proximity of the writes for each copy are likely to mean
>> they end up in the same erase block on the SSD (especially if the SSD
>> has a large write cache).
> Deja vu...
>
>>   However, that doesn't mean that one
>> getting corrupted due to device failure is guaranteed to corrupt the
>> other.  The reason for this is exactly the same reason that single
>> word errors in RAM are exponentially more common than losing a whole
>> chip or the whole memory module: The primary error source is
>> environmental noise (EMI, cosmic rays, quantum interference,
>> background radiation, etc), not system failure.  In other words,
>> you're far more likely to lose a single cell (which is usually not
>> more than a single byte in the MLC flash that gets used in most
>> modern SSD's) in the erase block than the whole erase block.  In that
>> event, you obviously have only got corruption in the particular
>> filesystem block that that particular cell was storing data for.
> Sounds reasonable...
>
>> There's also a third argument for not using DUP on SSD's however:
>> The SSD already does most of the data integrity work itself.
> DUP is really not for integrity but for consistency. If one copy of the
> block becomes damaged for perfectly reasonable instructions sent by the
> OS (from the drive firmware perspective), that block has perfect data
> integrity. But if it was the single copy of a metadata block, your FS
> is probably toast now. In DUP mode you still have the other copy for
> consistent filesystem structures. With this copy, the OS can now restore
> filesystem integrity (which is levels above block level integrity).
>
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: dup vs raid1 in single disk
  2017-02-07 22:28       ` Kai Krakow
                           ` (2 preceding siblings ...)
  2017-02-08  9:14         ` Alejandro R. Mosteo
@ 2017-02-08 13:02         ` Austin S. Hemmelgarn
  3 siblings, 0 replies; 10+ messages in thread
From: Austin S. Hemmelgarn @ 2017-02-08 13:02 UTC (permalink / raw)
  To: Kai Krakow, linux-btrfs

On 2017-02-07 17:28, Kai Krakow wrote:
> Am Thu, 19 Jan 2017 15:02:14 -0500
> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>
>> On 2017-01-19 13:23, Roman Mamedov wrote:
>>> On Thu, 19 Jan 2017 17:39:37 +0100
>>> "Alejandro R. Mosteo" <alejandro@mosteo.com> wrote:
>>>
>>>> I was wondering, from a point of view of data safety, if there is
>>>> any difference between using dup or making a raid1 from two
>>>> partitions in the same disk. This is thinking on having some
>>>> protection against the typical aging HDD that starts to have bad
>>>> sectors.
>>>
>>> RAID1 will write slower compared to DUP, as any optimization to
>>> make RAID1 devices work in parallel will cause a total performance
>>> disaster for you as you will start trying to write to both
>>> partitions at the same time, turning all linear writes into random
>>> ones, which are about two orders of magnitude slower than linear on
>>> spinning hard drives. DUP shouldn't have this issue, but still it
>>> will be twice slower than single, since you are writing everything
>>> twice.
>> As of right now, there will actually be near zero impact on write
>> performance (or at least, it's way less than the theoretical 50%)
>> because there really isn't any optimization to speak of in the
>> multi-device code.  That will hopefully change over time, but it's
>> not likely to do so any time in the future since nobody appears to be
>> working on multi-device write performance.
>
> I think that's only true if you don't account the seek overhead. In
> single device RAID1 mode you will always seek half of the device while
> writing data, and even when reading between odd and even PIDs. In
> contrast, DUP mode doesn't guarantee your seeks to be shorter but from
> a statistical point of view, on the average it should be shorter. So it
> should yield better performance (tho I wouldn't expect it to be
> observable, depending on your workload).
>
> So, on devices having no seek overhead (aka SSD), it is probably true
> (minus bus bandwidth considerations). For HDD I'd prefer DUP.
>
> From data safety point of view: It's more likely that adjacent
> and nearby sectors are bad. So DUP imposes a higher risk of written
> data being written to only bad sectors - which means data loss or even
> file system loss (if metadata hits this problem).
>
> To be realistic: I wouldn't trade space usage for duplicate data on an
> already failing disk, no matter if it's DUP or RAID1. HDD disk space is
> cheap, and using such a scenario is just waste of performance AND
> space - no matter what. I don't understand the purpose of this. It just
> results in fake safety.
>
> Better get two separate devices half the size. There's a better chance
> of getting a better cost/space ratio anyways, plus better performance
> and safety.
>
>> There's also the fact that you're writing more metadata than data
>> most of the time unless you're dealing with really big files, and
>> metadata is already DUP mode (unless you are using an SSD), so the
>> performance hit isn't 50%, it's actually a bit more than half the
>> ratio of data writes to metadata writes.
>>>
>>>> On a related note, I see this caveat about dup in the manpage:
>>>>
>>>> "For example, a SSD drive can remap the blocks internally to a
>>>> single copy thus deduplicating them. This negates the purpose of
>>>> increased redunancy (sic) and just wastes space"
>>>
>>> That ability is vastly overestimated in the man page. There is no
>>> miracle content-addressable storage system working at 500 MB/sec
>>> speeds all within a little cheap controller on SSDs. Likely most of
>>> what it can do, is just compress simple stuff, such as runs of
>>> zeroes or other repeating byte sequences.
>> Most of those that do in-line compression don't implement it in
>> firmware, they implement it in hardware, and even DEFLATE can get 500
>> MB/second speeds if properly implemented in hardware.  The firmware
>> may control how the hardware works, but it's usually hardware doing
>> heavy lifting in that case, and getting a good ASIC made that can hit
>> the required performance point for a reasonable compression algorithm
>> like LZ4 or Snappy is insanely cheap once you've gotten past the VLSI
>> work.
>
> I still thinks it's a myth... The overhead of managing inline
> deduplication is just way too high to implement it without jumping
> through expensive hoops. Most workloads have almost zero deduplication
> potential. And even when, their temporal occurrence is spaced so far
> that an inline deduplicator won't catch it.
Just like the proposed implementation in BTRFS, it's not complete 
deduplication.  In fact, the only devices I've ever seen that do this 
appear to implement it just like what was proposed for BTRFS, just with 
a much smaller cache.  They were also insanely expensive.
>
> If it would be all so easy, btrfs would already have it working in
> mainline. I don't even remember that those patches is still being
> worked on.
>
> With this in mind, I think dup metadata is still a good think to have
> even on SSD and I would always force to enable it.
Agreed.
>
> Potential for deduplication is only when using snapshots (which already
> are deduplicated when taken) or when handling user data on a file
> server in a multi-user environment. Users tend to copy their files all
> over the place - multiple directories of multiple gigabytes. Potential
> is also where you're working with client machine backups or vm images.
> I regularly see deduplication efficiency of 30-60% in such scenarios -
> file servers mostly which I'm handling. But due to temporally far
> spaced occurrence of duplicate blocks, only offline or nearline
> deduplication works here.
>
>>> And the DUP mode is still useful on SSDs, for cases when one copy
>>> of the DUP gets corrupted in-flight due to a bad controller or RAM
>>> or cable, you could then restore that block from its good-CRC DUP
>>> copy.
>> The only window of time during which bad RAM could result in only one
>> copy of a block being bad is after the first copy is written but
>> before the second is, which is usually an insanely small amount of
>> time.  As far as the cabling, the window for errors resulting in a
>> single bad copy of a block is pretty much the same as for RAM, and if
>> they're persistently bad, you're more likely to lose data for other
>> reasons.
>
> It depends on the design of the software. You're true if this memory
> block is simply a single block throughout its lifetime in RAM before
> written to storage. But if it is already handled as duplicate block in
> memory, odds are different. I hope btrfs is doing this right... ;-)
It's pretty debatable whether or not handling things as duplicates in 
RAM is correct or not.  Memory has higher error rates than most storage 
media, but it also is much more reasonable to expect it to have good 
EDAC mechanisms that most storage media.
>
>> That said, I do still feel that DUP mode has value on SSD's.  The
>> primary arguments against it are:
>> 1. It wears out the SSD faster.
>
> I don't think this is a huge factor, even more when looking at TBW
> capabilities of modern SSDs. And prices are low enough to better swap
> early than waiting for the disaster hitting you. Instead, you can still
> use the old SSD for archival storage (but this has drawbacks, don't
> leave them without power for months or years!) or as a shock resistent
> USB mobile drive on the go.
>
>> 2. The blocks are likely to end up in the same erase block, and
>> therefore there will be no benefit.
>
> Oh, this is probably a point to really think about... Would ssd_spread
> help here?
Not really, the ssd* mount options affect the chunk allocator only last 
I knew.
>
>> The first argument is accurate, but not usually an issue for most
>> people.  Average life expectancy for a decent SSD is well over 10
>> years, which is more than twice the usual life expectancy for a
>> consumer hard drive.
>
> Well, my first SSD (128 GB) was worn (according to SMART) after only 12
> months. Bigger drives wear much slower. I now have a 500 GB SSD and
> looking at SMART it projects to serve me well for the next 3-4 years
> or longer. But it will be worn out then. But I'm pretty sure I'll get a
> new drive until then - for performance and space reasons. My high usage
> pattern probably results from using the drives for bcache in write-back
> mode. Btrfs as the bcache user does it's own job (because of CoW) of
> pressing much more data through bcache than normal expectations.
FWIW, the quote I gave (which I didn't properly qualify for some 
reason...) Is with respect to the 2 Crucial MX200 SSD's I have in my 
home server system, which is primarily running BOINC apps most of the 
time.  Some brands are of course better than others (Kingston drives for 
example seem to have paradoxically short life spans in my experience).
>
>> As far as the second argument against it, that one is partially
>> correct, but ignores an important factor that many people who don't
>> do hardware design (and some who do) don't often consider.  The close
>> temporal proximity of the writes for each copy are likely to mean
>> they end up in the same erase block on the SSD (especially if the SSD
>> has a large write cache).
>
> Deja vu...
>
>>  However, that doesn't mean that one
>> getting corrupted due to device failure is guaranteed to corrupt the
>> other.  The reason for this is exactly the same reason that single
>> word errors in RAM are exponentially more common than losing a whole
>> chip or the whole memory module: The primary error source is
>> environmental noise (EMI, cosmic rays, quantum interference,
>> background radiation, etc), not system failure.  In other words,
>> you're far more likely to lose a single cell (which is usually not
>> more than a single byte in the MLC flash that gets used in most
>> modern SSD's) in the erase block than the whole erase block.  In that
>> event, you obviously have only got corruption in the particular
>> filesystem block that that particular cell was storing data for.
>
> Sounds reasonable...
>
>> There's also a third argument for not using DUP on SSD's however:
>> The SSD already does most of the data integrity work itself.
>
> DUP is really not for integrity but for consistency. If one copy of the
> block becomes damaged for perfectly reasonable instructions sent by the
> OS (from the drive firmware perspective), that block has perfect data
> integrity. But if it was the single copy of a metadata block, your FS
> is probably toast now. In DUP mode you still have the other copy for
> consistent filesystem structures. With this copy, the OS can now restore
> filesystem integrity (which is levels above block level integrity).
>
That's still data integrity from the filesystem and userspace's perspective.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-02-08 13:13 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CACNDjuzntG5Saq5HHNeDUmq-=28riKAerkO=CD=zAW-QofbKSg@mail.gmail.com>
2017-01-19 16:39 ` Fwd: dup vs raid1 in single disk Alejandro R. Mosteo
2017-01-19 17:06   ` Austin S. Hemmelgarn
2017-01-19 18:23   ` Roman Mamedov
2017-01-19 20:02     ` Austin S. Hemmelgarn
2017-01-21 16:00       ` Alejandro R. Mosteo
2017-02-07 22:28       ` Kai Krakow
2017-02-07 22:46         ` Hans van Kranenburg
2017-02-08  0:39         ` Dan Mons
2017-02-08  9:14         ` Alejandro R. Mosteo
2017-02-08 13:02         ` Austin S. Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.