All of lore.kernel.org
 help / color / mirror / Atom feed
* Feature Req: "mkfs.btrfs -d dup" option on single device
@ 2013-12-10 20:31 Imran Geriskovan
  2013-12-10 22:41 ` Chris Murphy
  0 siblings, 1 reply; 22+ messages in thread
From: Imran Geriskovan @ 2013-12-10 20:31 UTC (permalink / raw)
  To: linux-btrfs

Currently, if you want to protect your data against bit-rot on
a single device you must have 2 btrfs partitions and mount
them as Raid1. The requested option will save the user from
partitioning and will provide flexibility.

Yes, I know: This will not provide any safety againts hardware
failure. But it is not the purpose anyway.

The main purpose is to "Ensure Data Integrity" on:
a- Computers (ie. laptops) where hardware raid is not practical.
b- Backup sets (ie. usb drives) where hardware raid is an overkill.

Even if you have regular backups, without having
"Guaranteed Data Integrity" on all data sets, you will
lose some data on some day, somewhere.

See discussion at:
http://hardware.slashdot.org/story/13/12/10/178234/ask-slashdot-practical-bitrot-detection-for-backups


Now, the futuristic and OPTIONAL part for the sufficiently paranoid:
The number of duplicates may be parametric:

mkfs.btrfs -m dup 4 -d dup 3 ... (4 duplicates for metadata, 3
duplicates for data)

I kindly request your comments. (At least for "-d dup")

Regards,
Imran Geriskovan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Feature Req: "mkfs.btrfs -d dup" option on single device
  2013-12-10 20:31 Feature Req: "mkfs.btrfs -d dup" option on single device Imran Geriskovan
@ 2013-12-10 22:41 ` Chris Murphy
  2013-12-10 23:33   ` Imran Geriskovan
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Murphy @ 2013-12-10 22:41 UTC (permalink / raw)
  To: Imran Geriskovan; +Cc: linux-btrfs


On Dec 10, 2013, at 1:31 PM, Imran Geriskovan <imran.geriskovan@gmail.com> wrote:

> Currently, if you want to protect your data against bit-rot on
> a single device you must have 2 btrfs partitions and mount
> them as Raid1.

No this also works:

mkfs.btrfs -d dup -m dup -M <device>


Chris Murphy


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Feature Req: "mkfs.btrfs -d dup" option on single device
  2013-12-10 22:41 ` Chris Murphy
@ 2013-12-10 23:33   ` Imran Geriskovan
  2013-12-10 23:40     ` Chris Murphy
  0 siblings, 1 reply; 22+ messages in thread
From: Imran Geriskovan @ 2013-12-10 23:33 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

>> Currently, if you want to protect your data against bit-rot on
>> a single device you must have 2 btrfs partitions and mount
>> them as Raid1.

> No this also works:
> mkfs.btrfs -d dup -m dup -M <device>

Thanks a lot.

I guess docs need an update:

https://btrfs.wiki.kernel.org/index.php/Mkfs.btrfs:
"-d": Data profile, values like metadata. EXCEPT DUP CANNOT BE USED

man mkfs.btrfs (btrfs-tools 0.19+20130705)
-d, --data type
              Specify  how  the data must be spanned across
              the devices specified. Valid values are raid0, raid1,
              raid5,  raid6,  raid10  or single.

Imran

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Feature Req: "mkfs.btrfs -d dup" option on single device
  2013-12-10 23:33   ` Imran Geriskovan
@ 2013-12-10 23:40     ` Chris Murphy
       [not found]       ` <CAK5rZE6DVC5kYAU68oCjjzGPS4B=nRhOzATGM-5=m1_bW4GG6g@mail.gmail.com>
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Murphy @ 2013-12-10 23:40 UTC (permalink / raw)
  To: Imran Geriskovan; +Cc: linux-btrfs


On Dec 10, 2013, at 4:33 PM, Imran Geriskovan <imran.geriskovan@gmail.com> wrote:

>>> Currently, if you want to protect your data against bit-rot on
>>> a single device you must have 2 btrfs partitions and mount
>>> them as Raid1.
> 
>> No this also works:
>> mkfs.btrfs -d dup -m dup -M <device>
> 
> Thanks a lot.
> 
> I guess docs need an update:
> 
> https://btrfs.wiki.kernel.org/index.php/Mkfs.btrfs:
> "-d": Data profile, values like metadata. EXCEPT DUP CANNOT BE USED
> 
> man mkfs.btrfs (btrfs-tools 0.19+20130705)
> -d, --data type
>              Specify  how  the data must be spanned across
>              the devices specified. Valid values are raid0, raid1,
>              raid5,  raid6,  raid10  or single.

Current btrfs-progs is v3.12. 0.19 is a bit old. But yes, looks like the wiki also needs updating. 

Anyway I just tried it on an 8GB stick and it works, but -M (mixed data+metadata) is required, which documentation also says incurs a performance hit, although I'm uncertain of the significance.


Chris Murphy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Fwd: Feature Req: "mkfs.btrfs -d dup" option on single device
       [not found]       ` <CAK5rZE6DVC5kYAU68oCjjzGPS4B=nRhOzATGM-5=m1_bW4GG6g@mail.gmail.com>
@ 2013-12-11  0:17         ` Imran Geriskovan
  2013-12-11  0:33         ` Chris Murphy
  1 sibling, 0 replies; 22+ messages in thread
From: Imran Geriskovan @ 2013-12-11  0:17 UTC (permalink / raw)
  To: linux-btrfs; +Cc: lists

---------- Forwarded message ----------
From: Imran Geriskovan <imran.geriskovan@gmail.com>
Date: Wed, 11 Dec 2013 02:14:25 +0200
Subject: Re: Feature Req: "mkfs.btrfs -d dup" option on single device
To: Chris Murphy <lists@colorremedies.com>

> Current btrfs-progs is v3.12. 0.19 is a bit old. But yes, looks like the
> wiki also needs updating.

> Anyway I just tried it on an 8GB stick and it works, but -M (mixed
> data+metadata) is required, which documentation also says incurs a
> performance hit, although I'm uncertain of the significance.

btrfs-tools 0.19+20130705 is the most recent one on Debian's
leading edge Sid/Unstable.

Given the state of the docs probably very few or no people ever used
'-d dup'. As being the lead developer, is it possible for you to
provide some insights for the reliability of this option?

Can '-M' requirement be an indication of code which has not been
ironed out, or is it simply a constraint of the internal machinery?

How well does the main idea of "Guaranteed Data Integrity
for extra reliability" and the option "-d dup" in its current state match?

Regards,
Imran

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Feature Req: "mkfs.btrfs -d dup" option on single device
       [not found]       ` <CAK5rZE6DVC5kYAU68oCjjzGPS4B=nRhOzATGM-5=m1_bW4GG6g@mail.gmail.com>
  2013-12-11  0:17         ` Fwd: " Imran Geriskovan
@ 2013-12-11  0:33         ` Chris Murphy
  2013-12-11  3:19           ` Imran Geriskovan
                             ` (2 more replies)
  1 sibling, 3 replies; 22+ messages in thread
From: Chris Murphy @ 2013-12-11  0:33 UTC (permalink / raw)
  To: Imran Geriskovan; +Cc: Btrfs BTRFS


On Dec 10, 2013, at 5:14 PM, Imran Geriskovan <imran.geriskovan@gmail.com> wrote:

>> Current btrfs-progs is v3.12. 0.19 is a bit old. But yes, looks like the
>> wiki also needs updating.
> 
>> Anyway I just tried it on an 8GB stick and it works, but -M (mixed
>> data+metadata) is required, which documentation also says incurs a
>> performance hit, although I'm uncertain of the significance.
> 
> btrfs-tools 0.19+20130705 is the most recent one on Debian's
> leading edge Sid/Unstable.
> 
> Given the state of the docs probably very few or no people ever used
> '-d dup'. As being the lead developer, is it possible for you to
> provide some insights for the reliability of this option?

I'm not a developer, I'm just an ape who wears pants. Chris Mason is the lead developer. All I can say about it is that it's been working for me OK so far.

> Can '-M' requirement be an indication of code which has not been
> ironed out, or is it simply a constraint of the internal machinery?

I think it's just how chunks are allocated it becomes space inefficient to have two separate metadata and data chunks, hence the requirement to mix them if -d dup is used. But I'm not really sure.

> How well does the main idea of "Guaranteed Data Integrity
> for extra reliability" and the option "-d dup" in its current state match?

Well given that Btrfs is still flagged as experimental, most notably when creating any Btrfs file system, I'd say that doesn't apply here. If the case you're trying to mitigate is some kind of corruption that can only be repaired if you have at least one other copy of data, then -d dup is useful. But obviously this ignores the statistically greater chance of a more significant hardware failure, as this is still single device. Not only could the entire single device fail, but it's possible that erase blocks individually fail. And since the FTL decides where pages are stored, the duplicate data/metadata copies could be stored in the same erase block. So there is a failure vector other than full failure where some data can still be lost on a single device even with duplicate, or triplicate copies.


Chris Murphy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Feature Req: "mkfs.btrfs -d dup" option on single device
  2013-12-11  0:33         ` Chris Murphy
@ 2013-12-11  3:19           ` Imran Geriskovan
  2013-12-11  4:07             ` Chris Murphy
  2013-12-11 14:07             ` Martin
  2013-12-11  7:39           ` Feature Req: " Duncan
  2013-12-11 10:56           ` Duncan
  2 siblings, 2 replies; 22+ messages in thread
From: Imran Geriskovan @ 2013-12-11  3:19 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

> I'm not a developer, I'm just an ape who wears pants. Chris Mason is the
> lead developer. All I can say about it is that it's been working for me OK
> so far.

Great:) Now, I understand that you were using "-d dup", which is quite
valuable for me. And since GMail only show first names in Inbox list,
I thougth you were Chris Mason. Sorry. Now, I see your full name
in the header.


>> Can '-M' requirement be an indication of code which has not been
>> ironed out, or is it simply a constraint of the internal machinery?

> I think it's just how chunks are allocated it becomes space inefficient to
> have two separate metadata and data chunks, hence the requirement to mix
> them if -d dup is used. But I'm not really sure.


Sounds like it is implemented paralel/similar to "-m dup". That's why "-M"
is implied. Of course we are speculating here..

Now the question is, is it a good practice to use "-M" for large filesystems?
Pros, Cons? What is the performance impact? Or any other possible impact?


> Well given that Btrfs is still flagged as experimental, most notably when
> creating any Btrfs file system, I'd say that doesn't apply here. If the case
> you're trying to mitigate is some kind of corruption that can only be
> repaired if you have at least one other copy of data, then -d dup is useful.
> But obviously this ignores the statistically greater chance of a more
> significant hardware failure, as this is still single device.


>From the beginning we've put possiblity of full hardware failure aside.
The user is expected to handle that risk elsewhere.

Our scope is about localized failures which may cost you
some files. Since btrfs has checksums you may be aware of them.
Using "-d dup" we increase our chances of recovering from them.

But probablity of corruption of all duplicates is non zero.
Hence checking the output of "btrfs scrub start <path>" is beneficial
before making/updating any backups. And then check the output of the
scrub on the backup too..


> Not only could
> the entire single device fail, but it's possible that erase blocks
> individually fail. And since the FTL decides where pages are stored, the
> duplicate data/metadata copies could be stored in the same erase block. So
> there is a failure vector other than full failure where some data can still
> be lost on a single device even with duplicate, or triplicate copies.


I guess you are talking about SSD's. Even if you write duplicates
on distinct erase blocks, they may end up in same block after
firmware's relocation, defragmentation, migration, remapping,
god knows what ...ation operations. So practically, block
address does not point any fixed physical location on SSDs.

What's more (in relation to our long term data integrity aim)
order of magnitude for their unpowered data retension period is
1 YEAR. (Read it as 6months to 2-3 years. While powered they
refresh/shuffle the blocks) This makes SSDs
unsuitable for mid-to-long tem consumer storage. Hence they are
out of this discussion. (By the way, the only way for reliable
duplication on SSDs, is using physically seperate devices.)

Luckly we have hard drives with still sensible block addressing.
Even with bad block relocation. So duplication, triplicate,.... still
makes sense.. Or IS IT? Comments?

i.e. The new Advanced format drives may employ 4K blocks
but present 512B logical blocks which may be another reencarnation
of the SSD problem above. However, I guess linux kernel does not
access such drives using logical addressing..

Imran

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Feature Req: "mkfs.btrfs -d dup" option on single device
  2013-12-11  3:19           ` Imran Geriskovan
@ 2013-12-11  4:07             ` Chris Murphy
  2013-12-11  8:09               ` Hugo Mills
  2013-12-11 14:07             ` Martin
  1 sibling, 1 reply; 22+ messages in thread
From: Chris Murphy @ 2013-12-11  4:07 UTC (permalink / raw)
  To: Imran Geriskovan; +Cc: Btrfs BTRFS


On Dec 10, 2013, at 8:19 PM, Imran Geriskovan <imran.geriskovan@gmail.com> wrote:
> 
> Now the question is, is it a good practice to use "-M" for large filesystems?
> Pros, Cons? What is the performance impact? Or any other possible impact?

Uncertain. man mkfs.btrfs says "Mix data and metadata chunks together for more efficient space utilization.  This feature incurs a performance penalty in larger filesystems.  It is recommended for use with filesystems of 1 GiB or smaller."

I haven't benchmarked to quantify the penalty.

> I guess you are talking about SSD's. Even if you write duplicates
> on distinct erase blocks, they may end up in same block after
> firmware's relocation, defragmentation, migration, remapping,
> god knows what ...ation operations. So practically, block
> address does not point any fixed physical location on SSDs.

Yes SSDs, although it seems that any flash media could behave this way as it's up to the manufacturer's firmware how it ends up behaving.

> Luckly we have hard drives with still sensible block addressing.
> Even with bad block relocation.

Seagate has said they've already shipped 1 million shingled magnetic recording (SMR) hard drives. I don't know what sort of "FTL-like" behavior they implement, but it stands to reason that since the file system doesn't know what LBAs translate into physical sectors that are part of a layered band, and what LBA's are suited for random IO, that the drive might be capable of figuring this out. Random IO LBA's go to physical sectors suited for this, and sequential writes go to bands.

> i.e. The new Advanced format drives may employ 4K blocks
> but present 512B logical blocks which may be another reencarnation
> of the SSD problem above. However, I guess linux kernel does not
> access such drives using logical addressing.

It does, absolutely. All drives are access by LBA these days. And Linux does fine with both varieties of AF disks: 512e, and 4Kn.

Off hand I think the only issue is that pretty much no BIOS firmware will boot from a drive with 4K logical/physical sectors, so called 4Kn drives that do not present 512byte sectors. And since UEFI bugs are all over the place, I'd kinda expect booting to work with some firmware and not others. I haven't tested it, but I'm pretty sure I've read GRUB2 and the kernel are able to boot from 4Kn drives so long as the firmware can handle it.


Chris Murphy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Feature Req: "mkfs.btrfs -d dup" option on single device
  2013-12-11  0:33         ` Chris Murphy
  2013-12-11  3:19           ` Imran Geriskovan
@ 2013-12-11  7:39           ` Duncan
  2013-12-11 10:56           ` Duncan
  2 siblings, 0 replies; 22+ messages in thread
From: Duncan @ 2013-12-11  7:39 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Tue, 10 Dec 2013 17:33:59 -0700 as excerpted:

> On Dec 10, 2013, at 5:14 PM, Imran Geriskovan
> <imran.geriskovan@gmail.com> wrote:
> 
>> As being the lead developer, is it possible for you to
>> provide some insights for the reliability of this option?
> 
> I'm not a developer, I'm just an ape who wears pants. Chris Mason is the
> lead developer.

Lest anyone stumbling across this in a google or the like think 
otherwise, it's probably worthwhile to pull this bit out in its own post.

Chris Mason: btrfs lead dev.

Chris Murphy: not a dev, just a btrfs tester/user and btrfs list regular.

It's worth noting that this isn't a problem unique to this list or the 
Chris name.  There are I think three "Linus" on LKML now, with at least 
Linus W noting in good humor at one point that his posts do tend to get 
noticed a bit more due to his first name, even if he's not /the/ Linus, 
Torvalds, that is.

I've name-collided with other "John Duncan"s too.  Which is one reason 
I've been a mononym "Duncan" for over a decade now.  Strangely enough, 
I've had far fewer issues with Duncan as a mononym, than I did with "John 
Duncan".  I guess at least the "Duncan" mononym is rarer than the "John 
Duncan" name.

@ C. Murphy:  Given the situation, you might consider a "not a dev, just 
a user" disclaimer in your sig...  Or perhaps you'll find a Murphy mononym 
as useful as I have the Duncan mononym, altho I guess that has the 
Murphy's Law connotation, but that might have the effect of keeping the 
namespace even clearer (or not, I don't know), as well as making the 
mononym rather more memorable!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Feature Req: "mkfs.btrfs -d dup" option on single device
  2013-12-11  4:07             ` Chris Murphy
@ 2013-12-11  8:09               ` Hugo Mills
  2013-12-11 16:15                 ` Chris Murphy
  2013-12-11 17:46                 ` Duncan
  0 siblings, 2 replies; 22+ messages in thread
From: Hugo Mills @ 2013-12-11  8:09 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Imran Geriskovan, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 1185 bytes --]

On Tue, Dec 10, 2013 at 09:07:21PM -0700, Chris Murphy wrote:
> 
> On Dec 10, 2013, at 8:19 PM, Imran Geriskovan <imran.geriskovan@gmail.com> wrote:
> > 
> > Now the question is, is it a good practice to use "-M" for large filesystems?
> > Pros, Cons? What is the performance impact? Or any other possible impact?
> 
> Uncertain. man mkfs.btrfs says "Mix data and metadata chunks together for more efficient space utilization.  This feature incurs a performance penalty in larger filesystems.  It is recommended for use with filesystems of 1 GiB or smaller."

   That documentation needs tweaking. You need --mixed/-M for larger
filesystems than that. It's hard to say exactly where the optimal
boundary is, but somewhere around 16 GiB seems to be the dividing
point (8 GiB is in the "mostly going to cause you problems without it"
area). 16 GiB is what we have on the wiki, I think.

> I haven't benchmarked to quantify the penalty.

   Nor have I.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
                      --- vi: The core of evil. ---                      

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Feature Req: "mkfs.btrfs -d dup" option on single device
  2013-12-11  0:33         ` Chris Murphy
  2013-12-11  3:19           ` Imran Geriskovan
  2013-12-11  7:39           ` Feature Req: " Duncan
@ 2013-12-11 10:56           ` Duncan
  2013-12-11 13:19             ` Imran Geriskovan
  2 siblings, 1 reply; 22+ messages in thread
From: Duncan @ 2013-12-11 10:56 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Tue, 10 Dec 2013 17:33:59 -0700 as excerpted:

> On Dec 10, 2013, at 5:14 PM, Imran Geriskovan
> <imran.geriskovan@gmail.com> wrote:
> 
>>> Current btrfs-progs is v3.12. 0.19 is a bit old. But yes, looks like
>>> the wiki also needs updating.
>> 
>>> Anyway I just tried it on an 8GB stick and it works, but -M (mixed
>>> data+metadata) is required, which documentation also says incurs a
>>> performance hit, although I'm uncertain of the significance.
>> 
>> btrfs-tools 0.19+20130705 is the most recent one on Debian's leading
>> edge Sid/Unstable.

[I was debating where to reply, and chose here.]

To be fair, that's a snapshot date tag, 0.19 plus 2013-07-05 (which would 
be the date the git snapshot was taken), which isn't /that/ old, 
particularly for something like Debian.  There was a 0.20-rc1 about this 
time last year (Nov/Dec-ish 2012), but I guess Debian's date tags don't 
take rcs into account.

That said, as the wiki states, btrfs is still under /heavy/ development, 
anyone using it at this point is by definition a development filesystem 
tester, and said testers are strongly recommended to keep current with 
both kernel and btrfs-progs userspace both because not doing so 
unnecessarily risks whatever they're testing to already known and fixed 
bugs, and because if things /do/ go wrong, in addition to being little 
more than distracting noise if the bug is already fixed, reports from 
outdated versions simply aren't as useful if the bug remains unfixed.

Since the btrfs-progs git repo policy is master branch is always kept 
release-ready, development must be done on other branches and merged to 
master only when considered release-ready, ideally, all testers would run 
a current git build, either built themselves or for distros who choose to 
package a development/testing product like btrfs, built and updated by 
the distro on a weekly or monthly basis.  Of course that flies in the 
face of normal distro stabilization policies, but the point is, btrfs is 
NOT a normal distro stable package, and distros that choose to package it 
are by definition choosing to package a development package for their 
users to test /as/ a development package, and should update it 
accordingly.

And Debian or not Debian, a development status package last updated in 
July, when it's now December and there have been significant changes 
since July... might not be /that/ old in Debian terms, but it certainly 
isn't current, either!

>> Given the state of the docs probably very few or no people ever used
>> '-d dup'. As being the lead developer, is it possible for you to
>> provide some insights for the reliability of this option?
> 
> I'm not a developer, I'm just an ape who wears pants. Chris Mason is the
> lead developer. All I can say about it is that it's been working for me
> OK so far.

Least anyone finding this thread in google or the like think otherwise, 
it's probably worthwhile to emphasize that with a separate post... which 
I just did.

>> Can '-M' requirement be an indication of code which has not been ironed
>> out, or is it simply a constraint of the internal machinery?
> 
> I think it's just how chunks are allocated it becomes space inefficient
> to have two separate metadata and data chunks, hence the requirement to
> mix them if -d dup is used. But I'm not really sure.

AFAIK, duplicated data without RAID simply wasn't considered a reasonable 
use-case.  I'd certainly consider it so here, in particular because I 
*DO* use the data integrity and scrub features, but I'm actually using 
dual physical devices (SSDs in my case) in raid1 mode, instead.

The fact that mixed-data/metadata mode allows it is thus somewhat of an 
accident, more than a planned feature.  FWIW I had tried btrfs some time 
ago, then left as I decided it wasn't mature enough for my use-case at 
the time, and just came back in time to see mixed-mode going in.  Until 
mixed-mode, btrfs had quite some issues on 1-gig or smaller partitions as 
the pre-allocated separate data and metadata blocks simply didn't tend to 
balance out that well and one or the other would tend to be used up very 
fast, leaving the filesystem more or less useless in terms of further 
writes.  Mixed-data/metadata mode was added as an effective way of 
countering that problem, and in fact I've been quite pleased with how it 
has worked here on my smaller partitions.

My /boot is 256 MiB, I have one of those in dup-mode, meaning both data/
metadata dup since it's mixed-mode, on each of my otherwise btrfs raid1 
mode SSDs, thus allowing for an effective backup of what would otherwise 
be not easily and effectively backup-able, since bootloaders tend to 
allow pointing at only one such location (tho with grub2 on GPT 
partitioned devices with a BIOS reserved partition for grub2, that's not 
the issue it tended to be on MBR, since grub2 should still come up with 
its rescue mode shell even if it can't find the /boot it'd normally load 
the normal shell from, and the rescue-mode shell can be used to point at 
a different /boot, but then the same question applies to the grub 
installed on that BIOS partition, for which a second device with its own 
grub2 installed to its own BIOS partition is still the best backup), and 
allowing me to select and boot either one from BIOS.  My /var/log is 640 
MiB mixed-mode too, but in btrfs raid1, with the size, 640 MiB, chosen as 
about half a GiB, rounded up a bit and positioned such that all later 
partitions on the device start at an even GiB boundary.

In fact, I only recently realized the DUP-mode implications of mixed-mode 
on the /boot partitions myself, when I went to scrub them and then 
thought "Oh, but they're not raid1, so scrub won't work on the data."  
Except that it did, because the mixed-mode made the data as well as the 
metadata DUP-mode.

>> How well does the main idea of "Guaranteed Data Integrity for extra
>> reliability" and the option "-d dup" in its current state match?
> 
> Well given that Btrfs is still flagged as experimental, most notably
> when creating any Btrfs file system, I'd say that doesn't apply here.

Good point! =:^)

Tho the "experimental" level was recently officially toned down a notch, 
with a change to the kernel's btrfs option description that now says the 
btrfs on-device format is considered reasonably stable and will change 
only if absolutely necessary, and then only in such a way that new 
versions will remain able to mount old-device-format filesystems.  But 
it's still a not fully complete and well tested filesystem, and it 
remains under very heavy development, with fixes in every kernel series.

Meanwhile, it can be pointed out that there's currently no built-in way 
to access data that fails its checksum -- currently, if there's no valid 
second copy around due to raid1 or dup mode (and it can be noted, there's 
ONLY one additional copy ATM, no way to add further redundancy, tho N-way 
mirroring is planned after raid5/6, the currently focused in-development 
feature, is completed), or if that second copy fails its checksum as 
well, you're SOL.

That's guaranteed data integrity.  If the data can be accessed, its 
integrity is guaranteed due to the checksums.  If they fail, the data 
simply can no longer be accessed.  (Of course there's the nodatasum and 
nodatacow mount options which turn that off, and the NOCOW file 
attribute, which I believe turns off checksumming as well, and those are 
recommended for large-and-frequently-internally-written-file use-cases 
such as VM images, but those aren't the defaults.)

But while that might be guaranteed integrity, it's definitely NOT "extra 
reliability", at least in the actual accessible data sense, if you can't 
access the data AT ALL without a second copy around, which isn't 
available on a single device without data-dup-mode.

That was one reason I went multi-device and btrfs raid1 mode.  And I'd be 
much more comfortable if that N-way-mirroring feature was currently 
available and working as well.  I'd probably limit it to three-way, but I 
would definitely rest more comfortably with that three-way!

But, given btrfs' development status and thus the limits to trusting any 
such feature ATM, I think we're thinking future-stable-btrfs as much as 
current-development btrfs.  Three-way is definitely planned, and I agree 
with the premise of the original post as well, that there's a definite 
use-case for DUP-mode (and even TRIPL-mode, etc) on a single partition, 
too.

> If the case you're trying to mitigate is some kind of corruption that
> can only be repaired if you have at least one other copy of data, then
> -d dup is useful. But obviously this ignores the statistically greater
> chance of a more significant hardware failure, as this is still single
> device.

I'd take issue with the "statistically greater" assumption you make.  
Perhaps in theory, and arguably possibly in UPS-backed always-on 
scenarios as well, but I've had personal experience with failed checksums 
and subsequent scrubs here on my raid1 mode btrfs, that were NOT hardware 
faults, on quite new SSDs that I'd be VERY unhappy with if they /did/ 
actually start generating hardware faults.

In my case it's a variant of the unsafe shutdown scenario.  In 
particular, my SSDs takes a bit to stabilize after first turn-on, and one 
typically appears and is ready to take commands some seconds before the 
other one.  Now the kernel does have the rootwait commandline option to 
wait for devices to appear, and between that and the initramfs I have to 
use in ordered for a btrfs raid1-mode rootfs to mount properly 
(apparently rootflags=device=/dev/whatever,device=/dev/whatever2 doesn't 
parse properly, I'd guess due to splitting at the wrong equals, and 
without an initramfs I have to mount degraded, or at least I did a few 
kernels ago when I set things up), actual bootup works fine.

But suspend2ram apparently doesn't use the same rootwait mechanism, and 
if I leave my system in suspend2ram for more than a few hours (I'd guess 
whatever it takes for the SSDs caps to drain sufficiently so it takes too 
long to stabilize again), when I try to resume, one of the devices will 
appear first and the system will try to resume with only it, without the 
other device having shown up yet.

Unfortunately, btrfs raid1 mode doesn't yet cope particularly well with 
runtime (well here, resume-time) device-loss, and open-for-write files 
such as ~/.xsession-errors and /var/log/* start triggering errors almost 
immediately after resume, forcing the filesystems read-only and forcing 
an only semi-graceful reboot without properly closing those still-open-
for-writing-but-can't-be-written files.

Fortunately, those files are on independent btrfs non-root filesystems, 
and my also btrfs rootfs remains mounted read-only in normal operation, 
so there's very little chance of damage to the core system on the 
rootfs.  Only /home and /var/log are normally mounted writable (and the 
tmpfs-based /tmp, /run... of course, /var/lib and a few other things that 
need to be writable and retained over a reboot are symlinked to subdirs 
in /home).  And the writable filesystems have always remained bootable; 
they just have errors due to the abrupt device-drop and subsequent forced-
read-only of the remaining device with open-for-write files.

A scrub normally fixes them, altho in one case recently, it "fixed" both 
my user's .xsession-errors and .bash_history files to entire 
unreadability -- any attempt to read either one, even with cat, would 
lockup userspace (magic-srq would work, so the kernel wasn't entirely 
dead, but no userspace output). So scrub didn't save the files that time, 
even if it did apparently fix the metadata.  I couldn't log in, even in a 
non-X VT, as that user, until I deleted .bash_history.  And once I 
deleted the user's .bash_history, I could login non-X, but attempting to 
startx would again give me an unresponsive userspace, until I 
deleted .xsession-errors as well.

Needless-to-say, I've quit using suspend2ram for anything that might be 
longer than say four hours.  Unfortunately, suspend2disk aka hibernate 
didn't work on this machine last I tried it (it hibernated but resume 
would fail, tho that was when I first setup the machine over a year ago 
now, I really should try it again with a current kernel...), so all I 
have is reboot.  Tho with SSDs for the main system that's not so bad.  
And with it being winter here, the heat from a running system isn't 
entirely wasted, so for now I can leave the system on when I'd normally 
suspend2ram it during the 8-9 months out of the year I'm paying for any 
computer energy used twice, once to use it, again to pump it outside with 
the AC, here in Phoenix.

So the point of all that... data corruption isn't necessarily rarer than 
single-device hardware failure at all.  (Obviously in my case the fact 
that it's dual-devices in btrfs raid1 mode was a big part of the trigger; 
that wouldn't apply in single-device cases.  But there's other real-world 
corruption cases too, including simple ungraceful shutdowns that could 
well trigger the same sort of issues on a single device, that for a LOT 
of people are far more likely than hardware storage device failure.

So there's a definite use-case for single-device DUP/TRIPL/... mode, 
particularly so since that's what's required to actually make practical 
use of scrub and thus the actual available reliability side of the btrfs 
data integrity feature.

> Not only could the entire single device fail, but it's possible
> that erase blocks individually fail. And since the FTL decides where
> pages are stored, the duplicate data/metadata copies could be stored in
> the same erase block. So there is a failure vector other than full
> failure where some data can still be lost on a single device even with
> duplicate, or triplicate copies.

That's actually the reason btrfs defaults to SINGLE metadata mode on 
single-device SSD-backed filesystems, as well.

But as Imran points out, SSDs aren't all there is.  There's still 
spinning rust around.

And defaults aside, even on SSDs it should be /possible/ to specify data-
dup mode, because there's enough different SSD variants and enough 
different use-cases, that it's surely going to be useful some-of-the-time 
to someone. =:^)

And btrfs still being in development means it's a good time to make the 
request, before it's stabilized without data-dup mode, and possibly 
without the ability to easily add it because nobody thought the case was 
viable and it thus wasn't planned for before btrfs went stable. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Feature Req: "mkfs.btrfs -d dup" option on single device
  2013-12-11 10:56           ` Duncan
@ 2013-12-11 13:19             ` Imran Geriskovan
  2013-12-11 18:27               ` Duncan
  0 siblings, 1 reply; 22+ messages in thread
From: Imran Geriskovan @ 2013-12-11 13:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Duncan

> That's actually the reason btrfs defaults to SINGLE metadata mode on
> single-device SSD-backed filesystems, as well.
>
> But as Imran points out, SSDs aren't all there is.  There's still
> spinning rust around.
>
> And defaults aside, even on SSDs it should be /possible/ to specify data-
> dup mode, because there's enough different SSD variants and enough
> different use-cases, that it's surely going to be useful some-of-the-time
> to someone. =:^)

We didn't start with SSDs but the thread heads to there. Well ok then.

Since hard drives with more complex firmwares, hybrids, and so..
are becoming available. Eventually they will share common problems with
SSDs.

To make story short lets say "Eventually we all will have block address
devices, without any sensible physically bound addresses."

Without physically bound addresses, any duplicate written to device, MAY
end up in the same unreliable portion of the device. Note it "MAY". However
the devices are so large that, this probability is very low. The paranoid who
wants to make this lower may simply increase the number of duplicates.

On the other hand people who work with multiple physical devices
may want to decrease number of duplicates. (Probably to single copy)

Hence, there is definetely use case for tunable duplicates both
data and metadata.

Now, there is one open issue:
In its current form "-d dup" interferes with "-M". Is it constraint of design?
Or an arbitrary/temporary constraint. What will be the situation if there
is tunable duplicates?

And more:
Is "-M" good for everyday usage on large fs for efficient packing?
What's the penalty? Can it be curable? If so, why not make it default?

Imran

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Feature Req: "mkfs.btrfs -d dup" option on single device
  2013-12-11  3:19           ` Imran Geriskovan
  2013-12-11  4:07             ` Chris Murphy
@ 2013-12-11 14:07             ` Martin
  2013-12-11 15:31               ` Imran Geriskovan
  2013-12-11 23:32               ` SSD data retention, was: " Chris Murphy
  1 sibling, 2 replies; 22+ messages in thread
From: Martin @ 2013-12-11 14:07 UTC (permalink / raw)
  To: linux-btrfs

On 11/12/13 03:19, Imran Geriskovan wrote:

SSDs:

> What's more (in relation to our long term data integrity aim)
> order of magnitude for their unpowered data retension period is
> 1 YEAR. (Read it as 6months to 2-3 years. While powered they
> refresh/shuffle the blocks) This makes SSDs
> unsuitable for mid-to-long tem consumer storage. Hence they are
> out of this discussion. (By the way, the only way for reliable
> duplication on SSDs, is using physically seperate devices.)

Interesting...

Have you any links/quotes/studies/specs for that please?


Does btrfs need to date-stamp each block/chunk to ensure that data is
rewritten before suffering flash memory bitrot?

Is not the firmware in SSDs aware to rewrite any too-long unchanged data?


Regards,
Martin


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Feature Req: "mkfs.btrfs -d dup" option on single device
  2013-12-11 14:07             ` Martin
@ 2013-12-11 15:31               ` Imran Geriskovan
  2013-12-11 23:32               ` SSD data retention, was: " Chris Murphy
  1 sibling, 0 replies; 22+ messages in thread
From: Imran Geriskovan @ 2013-12-11 15:31 UTC (permalink / raw)
  To: Martin; +Cc: linux-btrfs

>> What's more (in relation to our long term data integrity aim)
>> order of magnitude for their unpowered data retension period is
>> 1 YEAR. (Read it as 6months to 2-3 years.

> Does btrfs need to date-stamp each block/chunk to ensure that data is
> rewritten before suffering flash memory bitrot?
> Is not the firmware in SSDs aware to rewrite any too-long unchanged data?

No. It is supposed to handled by firmware. That's why they should be
powered. It is not visible to the file system.
You can do a google search with terms "ssd data retension".

There is no concrete info about it. But figures range from:
- 10 years retention for new devices to
- 3-6 months for devices at their 'rated' usage.

Seems there is consensus about 1 year.  And it seems
SSD vendors are close to the datacenters.

Its todays tech. In time we'll see if it will get better or worse.

In the long run, we may have no choice but to put all our data
in the hands of belowed cloud lords. Hence the NSA. :)

Note that Sony has shutdown its optical disc unit.

Regards,
Imran

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Feature Req: "mkfs.btrfs -d dup" option on single device
  2013-12-11  8:09               ` Hugo Mills
@ 2013-12-11 16:15                 ` Chris Murphy
  2013-12-11 17:46                 ` Duncan
  1 sibling, 0 replies; 22+ messages in thread
From: Chris Murphy @ 2013-12-11 16:15 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Btrfs BTRFS


On Dec 11, 2013, at 1:09 AM, Hugo Mills <hugo@carfax.org.uk> wrote:

>   That documentation needs tweaking. You need --mixed/-M for larger
> filesystems than that. It's hard to say exactly where the optimal
> boundary is, but somewhere around 16 GiB seems to be the dividing
> point (8 GiB is in the "mostly going to cause you problems without it"
> area). 16 GiB is what we have on the wiki, I think.

Yes, man mkfs.btrfs also doesn't list dup as a possible option for -d.


Chris Murphy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Feature Req: "mkfs.btrfs -d dup" option on single device
  2013-12-11  8:09               ` Hugo Mills
  2013-12-11 16:15                 ` Chris Murphy
@ 2013-12-11 17:46                 ` Duncan
  1 sibling, 0 replies; 22+ messages in thread
From: Duncan @ 2013-12-11 17:46 UTC (permalink / raw)
  To: linux-btrfs

Hugo Mills posted on Wed, 11 Dec 2013 08:09:02 +0000 as excerpted:

> On Tue, Dec 10, 2013 at 09:07:21PM -0700, Chris Murphy wrote:
>> 
>> On Dec 10, 2013, at 8:19 PM, Imran Geriskovan
>> <imran.geriskovan@gmail.com> wrote:
>> > 
>> > Now the question is, is it a good practice to use "-M" for large
>> > filesystems?
>> 
>> Uncertain. man mkfs.btrfs says "Mix data and metadata chunks together
>> for more efficient space utilization.  This feature incurs a
>> performance penalty in larger filesystems.  It is recommended for use
>> with filesystems of 1 GiB or smaller."
> 
> That documentation needs tweaking. You need --mixed/-M for larger
> filesystems than that. It's hard to say exactly where the optimal
> boundary is, but somewhere around 16 GiB seems to be the dividing point
> (8 GiB is in the "mostly going to cause you problems without it"
> area). 16 GiB is what we have on the wiki, I think.

I believe it also depends on the expected filesystem fill percentage and 
how that interacts with chunk sizes.  I posted some thoughts on this in 
another thread a couple weeks(?) ago.  Here's a rehash.

On large enough filesystems with enough unallocated space, data chunks 
are 1 GiB, while metadata chunks are 256 MiB, but I /think/ dup-mode 
means that'll double as they'll allocate in pairs.  For balance to do its 
thing and to avoid unexpected out-of-space errors, you need at least 
enough unallocated space to easily allocate one of each as the need 
arises (assuming file sizes significantly under a gig, so the chances of 
having to allocate two or more data chunks at once is reasonably low), 
which with normal separate data/metadata chunks, means 1.5 GiB 
unallocated, absolute minimum.  (2.5 gig if dup data also, 1.25 gig if 
single data and single metadata, or on each of two devices in raid1 data 
and metadata mode.)

Based on the above, it shouldn't be unobvious (hmm... double negative,
/should/ be /obvious/, but that's not /quite/ the nuance I want... the 
double negative stays) that with separate data/metadata, once the 
unallocated-free-space goes below the level required to allocate one of 
each, things get WAAYYY more complex and any latent corner-case bugs are 
far more likely to trigger.

And it's equally if not even more obvious (no negatives this time) that 
this 1.5 GiB "minimum safe reserve" space is going to be a MUCH larger 
share of say a 4 or 8 GiB filesystem, than it will of say a 32 GiB or 
larger filesystem.


However, I've had no issues with my root filesystems, 8 GiB each on two 
separate devices in btrfs raid1 (both data and metadata) mode, but I 
believe that's in large part because actual data usage according to btrfs 
fi df is 1.64 GiB (4 gig allocated), metadata 274 MiB (512 meg 
allocated).  There's plenty of space left unallocated, well more than the 
minimum-safe 1.25 gigs on each of two devices (1.25 gigs each not 1.5 
gigs each since there's only one metadata copy on each, not the default 
two of single-device dup mode).  And I'm on ssd with small filesystems, 
so a full balance takes about 2 minutes on that filesystem, not the hours 
to days often reported for multi-terabyte filesystems on spinning rust.  
So it's easy to full-balance any time allocated usage (as reported by 
btrfs filesystem show) starts to climb too far beyond actual used bytes 
within that allocation (as reported by btrfs filesystem df).

That means the filesystem says healthy, with lots of unallocated freespace 
in reserve, should it be needed.  And even in the event something goes 
hog wild and uses all that space (logs, the usual culprits, are on a 
separate filesystem, as is /home, so it'd have to be a core system 
"something" going hog-wild!), at 8 gigs, I can easily do a temporary 
btrfs device add if I have to, to get the space necessary for a proper 
balance to do its thing.

I'm actually much more worried about my 24 gig, 21.5 gigs used, packages-
cache filesystem, tho it's only my cached gentoo packages tree, cached 
sources, etc, so it's easily restored direct from the net if it comes to 
that.  Before the rebalance I just did while writing this post, above, 
btrfs fi show reported it using 22.53 of 24.00 gigs (on each of the two 
devices in btrfs raid1), /waaayyy/ too close to that magic 1.25 GiB to be 
comfortable!  And after the balance it's still 21.5 gig used out of 24, 
so as it is, it's a DEFINITE candidate for an out-of-space error at some 
point.  I guess I need to clean up old sources and binpkgs, before I 
actually get that out-of-space and can't balance to fix it due to too 
much stale binpkg/sources cache.  I did recently update to kde 4.12 
branch live-git from 4.11-branch and I guess cleaning up the old 4.11 
binpkgs should release a few gigs.  That and a few other cleanups should 
bring it safely into line... for now... but the point is, that 24 gig 
filesystem both tends to run much closer to full and has a much more 
dramatic full/empty/full cycle than either my root or home filesystems, 
at 8 gig and 20 gig respectively.  It's the 24-gig where mixed-mode would 
really help; the others are fine as they are.

Meanwhile, I suspect the biggest down sides of mixed-mode to be two-fold, 
first, the size penalty of the implied dup-data-by-default of mixed-mode 
on a single device filesystem.  Typically, data will run an order of 
magnitude larger than its metadata, two orders of magnitude if the files 
are large.  Duping all those extra data bytes can really hurt, space-
wise, compared to just duping metadata, and on a multi-terabyte single-
device filesystem, it can mean the difference between a terabyte of data 
and two terabytes of data.  No filesystem developer wants their 
filesystem to get the reputation of wasting terabytes of space, 
especially for the non-technical folks who can't tell what benefit (scrub 
actually has a second copy to recover from!) they're getting from it, so 
dup data simply isn't a practical default, regardless of whether it's in 
the form of separate dup-data chunks or mixed dup-data/metadata chunks.  
Yet when treated separately, the benefits of dup-metadata clearly 
outweigh the costs, and it'd be a shame to lose that to a single-mode 
mixed-mode default, so mixed-mode remains dup-by-default, even if that 
entails the extra cost of dup-data-by-default.

That's the first big negative of mixed-mode, the huge space cost of the 
implicit dup-data-by-default.

The second major downside of mixed mode surely relates to the performance 
cost of the actual IO of all that extra data, particularly on spinning 
rust.  First, actually writing all that extra data out, especially with 
the extra seeks now necessary to write it to two entirely separate 
chunks, altho that'll be somewhat mitigated by the fact that data and 
metadata will be combined so there's likely less seeking between data and 
metadata.  But on the read side, the shear volume of all that intertwined 
data and metadata must also mean far more seeking in all the directory 
data before the target file can even be accessed in the first place, and 
that's likely to exact a heavy read-side toll indeed, at least until the 
directory cache is warmed up.  Cold-boot times are going to suffer 
something fierce!

In terms of space, a rough calculation demonstrates a default-settings 
large file crossover near 4 GiB.  Consider a two-gigs data case.  With 
separate data/metadata, we'll have two gigs of data in single-mode, plus 
256 megs of metadata, dup mode so doubled to half a gig, so say 2.5 gig 
allocated (it'd actually be a bit more due to the system chunk, doubled 
due to dup).  As above a safe unallocated reserve is one of each, 
metadata again doubled due to dup, so 1.5 gig.  Usage is thus 2.5 gig 
allocated plus 1.5 gig reserved, about 4 gig.

The same two-gigs data in mixed mode ends up taking about 5 gig of 
filesystem space, two-gigs data doubles to four due to mixed-mode-dup.  
Metadata will be mixed in the same chunks, but won't fit in the same four 
gigs as that's all data, so that'll be say another possibly 128 megs 
duped to a quarter gig, or 256 megs duped to a half gig, depending on 
what's being allocated for mixed-mode chunk size.  Then another quarter 
or half gig must be reserved for allocation if needed, and there's the 
system allocation to consider too.  So we're looking at about 4.5 or 5 
gig.

More data means an even higher space cost for the duped mixed-mode data, 
while the separate-mode data/metadata reserved space requirement remains 
nearly constant.  At 4 gigs actual data, we're looking at nearing 9 gigs 
space cost for mixed, while separate will be only 4+.5+1.5, about 6 
gigs.  At 10 gigs actual data, we're looking at 21 gigs mixed-mode, 
perhaps 21.5 if additional mixed chunks need allocated for metadata, only 
10+.5+1.5, about 12 gigs separate mode, perhaps 12.5 or 13 if additional 
metadata chunks need allocated, so as you can see, the size cost for that 
duped data is getting dramatically worse relative to the default-single 
separate data mode.

Of course if you'd run dup separate data if it were an option, that space 
cost zeroes out, and I suspect a lot of the performance cost does too.

Similarly but the other way for a dual-device raid1 both data/metadata, 
since from a single device perspective, it's effectively single-mode to 
each device separately.  Mixed-mode space-cost and I suspect much of the 
performance cost as well thus zeroes out as compared to separate-mode 
raid1-mode for both data/metadata.  Tho due to the metadata being spread 
more widely as it's mixed with the data, I suspect there's very likely 
still the read performance cost of those additional seeks necessary to 
gather the metadata to actually find the target file before it can be 
read, so cold-cache and thus cold-boot performance is still likely to 
suffer quite a bit.

Above 4 gig, it's really use-case dependent, depending particularly on 
single/dup/raid mode chosen and slow-ssd/spinning-rust/fast-ssd physical 
device, how much of the filesystem is expected to actually be used, and 
how actively it's going to be near-empty-to-near-full cycled, but 16 gigs 
would seems to be a reasonable general-case cut-over recommendation, 
perhaps 32 or 64 gigs for single-device single-mode or dual-device raid1 
mode on fast ssd, maybe 8 gigs for high free-space cases lacking a 
definite fill/empty/fill/empty pattern, or on particularly slow-seek 
spinning rust.

As for me, this post has helped convince me that I really should make 
that package-cache filesystem mixed-mode when I next mkfs.btrfs it.  It's 
20 gigs of data on a 24-gig filesystem, which wouldn't fit if I were 
going default data-single to default mixed-dup on a single device, but 
it's raid1 both data and metadata on dual fast ssd devices, so usage 
should stay about the same, while flexibility will go up, and as best I 
can predict, performance shouldn't suffer much either since I'm on fast 
ssds with what amounts to a zero seek time.

But I have little reason to change either rootfs or /home, 8 gigs about 
4.5 used, and 20 gigs about 14 used, respectively, from their current 
separate data/metadata.  Tho doing a fresh mkfs.btrfs on them and copying 
everything back from backup will still be useful, as it'll allow them to 
make use of newer features like the 16 KiB default node size and skinny 
metadata, that they're not using now.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Feature Req: "mkfs.btrfs -d dup" option on single device
  2013-12-11 13:19             ` Imran Geriskovan
@ 2013-12-11 18:27               ` Duncan
  2013-12-12 15:57                 ` Chris Mason
  0 siblings, 1 reply; 22+ messages in thread
From: Duncan @ 2013-12-11 18:27 UTC (permalink / raw)
  To: linux-btrfs

Imran Geriskovan posted on Wed, 11 Dec 2013 15:19:29 +0200 as excerpted:

> Now, there is one open issue:
> In its current form "-d dup" interferes with "-M". Is it constraint of
> design?
> Or an arbitrary/temporary constraint. What will be the situation if
> there is tunable duplicates?

I believe I answered that, albeit somewhat indirectly, when I explained 
that AFAIK, the fact that -M (mixed mode) has the effect of allowing -d 
dup mode is an accident.  Mixed mode was introduced to fix the very real 
problem of small btrfs filesystems tending to run out of either data or 
metadata space very quickly, while having all sorts of the other resource 
still available, due to inappropriate separate mode allocations.  And it 
fixed that problem rather well, IMO and experience! =:^)

Thus mixed-mode wasn't designed to enable duped data at all, but rather 
to solve a very different problem (which it did very well), and I'm not 
sure the devs even realized that the dup-data it enabled as a side effect 
of forcing data and metadata to the same dup-mode, was a feature people 
might actually want on its own, until after the fact.

So I doubt very much it was a constraint of the design.  If it was 
deliberate, I expect they'd have enabled data=dup mode directly.  Rather, 
it was purely an accident, The fixed the unbalanced small-filesystem 
allocation issue by enabling a mixed mode that as a side effect of 
combining data and metadata into the same blocks, also happened to allow 
data=dup by pure accident.

Actually, it may be that they're only with this thread seeing people 
actually wanting the data=dup option on its own, and why they might want 
it.  Tho it's equally possible they realized that some time ago, shortly 
after accidentally enabling it via mixed-mode, and have it on their list 
since then but have simply been to busy fixing bugs and working on 
features such as the still unfinished raid5/6 code to get to this.  We'll 
only know if they post, but regardless of whether they saw it before or 
not, it'd be pretty hard to avoid seeing it with what this thread has 
blossomed into, so I'm sure they see it now! =:^)

> And more:
> Is "-M" good for everyday usage on large fs for efficient packing?
> What's the penalty? Can it be curable? If so, why not make it default?

I believe I addressed that in the post I just sent, which took me some 
time to compose as I kept ending up way into the weeds on other topics, 
and I ended up deleting multiple whole paragraphs in ordered to rewrite 
them hopefully better, several times.

In brief, I believe the biggest penalties won't apply in your case, since 
they're related to the dup-data effect, and that's actually what you're 
interested in, so they'd apply or not apply regardless of mixed-mode.

But I do expect there are two penalties in general, the first being the 
raw effect of mass duplicating large quantities of data (as opposed to 
generally an order of magnitude smaller metadata only) by default, the 
second having to do with what that does to IO performance, particularly 
uncached directory/metadata reads and the resulting seeks necessary to 
find a file before reading it in the first place.  That's going to 
absolutely murder cold-boot times on spinning rust, to give one highly 
performance-critical example that has been the focus of numerous articles 
and can-I-make-it-boot-faster-than-N-seconds projects over the years.  
Absolutely murder that, as mixed mode very well might on spinning rust, 
and your pet development filesystem will very likely go over like a lead 
balloon!  So it's little wonder they discourage people using it for 
anything but the smallest filesystems, where it is portrayed as a 
workaround to an otherwise very difficult problem!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SSD data retention, was: "mkfs.btrfs -d dup" option on single device
  2013-12-11 14:07             ` Martin
  2013-12-11 15:31               ` Imran Geriskovan
@ 2013-12-11 23:32               ` Chris Murphy
  1 sibling, 0 replies; 22+ messages in thread
From: Chris Murphy @ 2013-12-11 23:32 UTC (permalink / raw)
  To: Martin; +Cc: linux-btrfs


On Dec 11, 2013, at 7:07 AM, Martin <m_btrfs@ml1.co.uk> wrote:

> On 11/12/13 03:19, Imran Geriskovan wrote:
> 
> SSDs:
> 
>> What's more (in relation to our long term data integrity aim)
>> order of magnitude for their unpowered data retension period is
>> 1 YEAR. (Read it as 6months to 2-3 years. While powered they
>> refresh/shuffle the blocks) This makes SSDs
>> unsuitable for mid-to-long tem consumer storage. Hence they are
>> out of this discussion. (By the way, the only way for reliable
>> duplication on SSDs, is using physically seperate devices.)
> 
> Interesting...
> 
> Have you any links/quotes/studies/specs for that please?


This is an eye opener:
http://techreport.com/review/25681/the-ssd-endurance-experiment-testing-data-retention-at-300tb

Check page 2's section "Data Retention", which glosses over unpowered data retention. But there's a test on a 200GB hashed file on one of the SSDs that fails checksum. Fails a 2nd time, but with a different checksum. And again with yet a different checksum. Then powered off for five days. And the checksum passes. Isn't that adorable?

What the article doesn't say, that's rather important, is whether the drive reported a read error to the SATA driver. If not, then ECC didn't do it's job, differently, three times. If there was a read error, then that's still not good but distinctly better than the former.

So the issue with unpowered data retention is that the time varies a ton based on wear. The more blocks have been written to or erased, the lower their retention time without power. I think 2-3 years right now is really optimistic seeing as 


> Does btrfs need to date-stamp each block/chunk to ensure that data is
> rewritten before suffering flash memory bitrot?
> 
> Is not the firmware in SSDs aware to rewrite any too-long unchanged data?

My understanding, which may be wrong, is that the drive only needs power. The data doesn't need to be re-written.


Chris Murphy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Feature Req: "mkfs.btrfs -d dup" option on single device
  2013-12-11 18:27               ` Duncan
@ 2013-12-12 15:57                 ` Chris Mason
  2013-12-12 17:58                   ` David Sterba
  2013-12-17 18:37                   ` Imran Geriskovan
  0 siblings, 2 replies; 22+ messages in thread
From: Chris Mason @ 2013-12-12 15:57 UTC (permalink / raw)
  To: Duncan, linux-btrfs

Quoting Duncan (2013-12-11 13:27:53)
> Imran Geriskovan posted on Wed, 11 Dec 2013 15:19:29 +0200 as excerpted:
> 
> > Now, there is one open issue:
> > In its current form "-d dup" interferes with "-M". Is it constraint of
> > design?
> > Or an arbitrary/temporary constraint. What will be the situation if
> > there is tunable duplicates?
> 
> I believe I answered that, albeit somewhat indirectly, when I explained 
> that AFAIK, the fact that -M (mixed mode) has the effect of allowing -d 
> dup mode is an accident.  Mixed mode was introduced to fix the very real 
> problem of small btrfs filesystems tending to run out of either data or 
> metadata space very quickly, while having all sorts of the other resource 
> still available, due to inappropriate separate mode allocations.  And it 
> fixed that problem rather well, IMO and experience! =:^)
> 
> Thus mixed-mode wasn't designed to enable duped data at all, but rather 
> to solve a very different problem (which it did very well), and I'm not 
> sure the devs even realized that the dup-data it enabled as a side effect 
> of forcing data and metadata to the same dup-mode, was a feature people 
> might actually want on its own, until after the fact.
> 
> So I doubt very much it was a constraint of the design.  If it was 
> deliberate, I expect they'd have enabled data=dup mode directly.  Rather, 
> it was purely an accident, The fixed the unbalanced small-filesystem 
> allocation issue by enabling a mixed mode that as a side effect of 
> combining data and metadata into the same blocks, also happened to allow 
> data=dup by pure accident.

For me anyway, data=dup in mixed mode is definitely an accident ;)

I personally think data dup is a false sense of security, but drives
have gotten so huge that it may actually make sense in a few
configurations.

Someone asks for it roughly once a year, so it probably isn't a horrible
idea.

The biggest problem with mixed mode is just that it isn't very common.
You'll end up finding corners that others do not.  Also mixed mode
forces your metadata block size down to 4K, which does increase
fragmentation over time.  The new default of 16K is overall much faster.

-chris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Feature Req: "mkfs.btrfs -d dup" option on single device
  2013-12-12 15:57                 ` Chris Mason
@ 2013-12-12 17:58                   ` David Sterba
  2013-12-13  9:33                     ` Duncan
  2013-12-17 18:37                   ` Imran Geriskovan
  1 sibling, 1 reply; 22+ messages in thread
From: David Sterba @ 2013-12-12 17:58 UTC (permalink / raw)
  To: Chris Mason; +Cc: Duncan, linux-btrfs

On Thu, Dec 12, 2013 at 10:57:33AM -0500, Chris Mason wrote:
> Quoting Duncan (2013-12-11 13:27:53)
> > Imran Geriskovan posted on Wed, 11 Dec 2013 15:19:29 +0200 as excerpted:
> > 
> > > Now, there is one open issue:
> > > In its current form "-d dup" interferes with "-M". Is it constraint of
> > > design?
> > > Or an arbitrary/temporary constraint. What will be the situation if
> > > there is tunable duplicates?
> > 
> > I believe I answered that, albeit somewhat indirectly, when I explained 
> > that AFAIK, the fact that -M (mixed mode) has the effect of allowing -d 
> > dup mode is an accident.  Mixed mode was introduced to fix the very real 
> > problem of small btrfs filesystems tending to run out of either data or 
> > metadata space very quickly, while having all sorts of the other resource 
> > still available, due to inappropriate separate mode allocations.  And it 
> > fixed that problem rather well, IMO and experience! =:^)
> > 
> > Thus mixed-mode wasn't designed to enable duped data at all, but rather 
> > to solve a very different problem (which it did very well), and I'm not 
> > sure the devs even realized that the dup-data it enabled as a side effect 
> > of forcing data and metadata to the same dup-mode, was a feature people 
> > might actually want on its own, until after the fact.
> > 
> > So I doubt very much it was a constraint of the design.  If it was 
> > deliberate, I expect they'd have enabled data=dup mode directly.  Rather, 
> > it was purely an accident, The fixed the unbalanced small-filesystem 
> > allocation issue by enabling a mixed mode that as a side effect of 
> > combining data and metadata into the same blocks, also happened to allow 
> > data=dup by pure accident.
> 
> For me anyway, data=dup in mixed mode is definitely an accident ;)

I've asked to allow data=dup in mixed mode when Ilya implementd the
validations of balance filters. That's a convenient way how to get
mirrored data on a single device.

> I personally think data dup is a false sense of security, but drives
> have gotten so huge that it may actually make sense in a few
> configurations.

It's not perfect yeah.

> Someone asks for it roughly once a year, so it probably isn't a horrible
> idea.
> 
> The biggest problem with mixed mode is just that it isn't very common.
> You'll end up finding corners that others do not.  Also mixed mode
> forces your metadata block size down to 4K, which does increase
> fragmentation over time.  The new default of 16K is overall much faster.

I've been testing --mixed mode with various other raid profiles types as
far as I remember. Some bugs popped up, reported and Josef fixed them.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Feature Req: "mkfs.btrfs -d dup" option on single device
  2013-12-12 17:58                   ` David Sterba
@ 2013-12-13  9:33                     ` Duncan
  0 siblings, 0 replies; 22+ messages in thread
From: Duncan @ 2013-12-13  9:33 UTC (permalink / raw)
  To: linux-btrfs

David Sterba posted on Thu, 12 Dec 2013 18:58:16 +0100 as excerpted:

> I've been testing --mixed mode with various other raid profiles types as
> far as I remember. Some bugs popped up, reported and Josef fixed them.

FWIW, I'm running a mixed-mode btrfs-raid1 mode here on my log partition 
and haven't seen any issues, altho that's sub-GiB (640 MiB), so anything 
that would only be seen at larger sizes will not have triggered, here.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Feature Req: "mkfs.btrfs -d dup" option on single device
  2013-12-12 15:57                 ` Chris Mason
  2013-12-12 17:58                   ` David Sterba
@ 2013-12-17 18:37                   ` Imran Geriskovan
  1 sibling, 0 replies; 22+ messages in thread
From: Imran Geriskovan @ 2013-12-17 18:37 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On 12/12/13, Chris Mason <clm@fb.com> wrote:
> For me anyway, data=dup in mixed mode is definitely an accident ;)
> I personally think data dup is a false sense of security, but drives
> have gotten so huge that it may actually make sense in a few
> configurations.

Sure, it's not about any security regarding the device.

It's about the capability of recovering from any
bit-rot which can creep into your backups and can be
detected when you need the file after 20-30 generations
of backups which is too late. (Who keeps that much
incremental archive and reads backup logs of millions of
files, regularly?)

> Someone asks for it roughly once a year, so it probably isn't a horrible
> idea.
> -chris

Today, I've brought up an old 2 GB Seagate from the basement.
Literaly, it has been "Rusted". So it deserves the title of
"Spinning Rust" for real. I had no hope whether it would work,
but out of curiosity I plugged it into a USB-IDE box.

It spinned up and wow!; it showed up among the devices.
It had two swap and an ext2 partition. I remembered that it was
one of the disk used for linux installations more than
10 years ago. I mounted it . Most of the files dates back to 2001-07.

They are more than 12 years old and they seem to be intact
with just one inode size missmatch. (See fsck output below).

If there were BTRFS (and -d dup :) ) at the time, now I would
perform a scrub and report the outcome here. Hence,
'Digital Archeology' can surely benefit from Btrfs. :)

PS: And regarding the "SSD data retension debate" this can be an
interesting benchmark for a device whick was kept in an unfavorable
environment.

Regards,
Imran


FSCK output:

fsck from util-linux 2.20.1
e2fsck 1.42.8 (20-Jun-2013)
/dev/sdb3 has gone 4209 days without being checked, check forced.
Pass 1: Checking inodes, blocks, and sizes
Special (device/socket/fifo) inode 82669 has non-zero size.  Fix<y>? yes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/sdb3: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sdb3: 41930/226688 files (1.0% non-contiguous), 200558/453096 blocks

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2013-12-17 18:37 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-10 20:31 Feature Req: "mkfs.btrfs -d dup" option on single device Imran Geriskovan
2013-12-10 22:41 ` Chris Murphy
2013-12-10 23:33   ` Imran Geriskovan
2013-12-10 23:40     ` Chris Murphy
     [not found]       ` <CAK5rZE6DVC5kYAU68oCjjzGPS4B=nRhOzATGM-5=m1_bW4GG6g@mail.gmail.com>
2013-12-11  0:17         ` Fwd: " Imran Geriskovan
2013-12-11  0:33         ` Chris Murphy
2013-12-11  3:19           ` Imran Geriskovan
2013-12-11  4:07             ` Chris Murphy
2013-12-11  8:09               ` Hugo Mills
2013-12-11 16:15                 ` Chris Murphy
2013-12-11 17:46                 ` Duncan
2013-12-11 14:07             ` Martin
2013-12-11 15:31               ` Imran Geriskovan
2013-12-11 23:32               ` SSD data retention, was: " Chris Murphy
2013-12-11  7:39           ` Feature Req: " Duncan
2013-12-11 10:56           ` Duncan
2013-12-11 13:19             ` Imran Geriskovan
2013-12-11 18:27               ` Duncan
2013-12-12 15:57                 ` Chris Mason
2013-12-12 17:58                   ` David Sterba
2013-12-13  9:33                     ` Duncan
2013-12-17 18:37                   ` Imran Geriskovan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.