All of lore.kernel.org
 help / color / mirror / Atom feed
* checksum error in metadata node - best way to move root fs to new drive?
@ 2016-08-10  3:27 Dave T
  2016-08-10  6:27 ` Duncan
  2016-08-10 21:15 ` Chris Murphy
  0 siblings, 2 replies; 28+ messages in thread
From: Dave T @ 2016-08-10  3:27 UTC (permalink / raw)
  To: linux-btrfs

btrfs scrub returned with uncorrectable errors. Searching in dmesg
returns the following information:

BTRFS warning (device dm-0): checksum error at logical NNNNN on
/dev/mapper/[crypto] sector: yyyyy metadata node (level 2) in tree 250

it also says:

unable to fixup (regular) error at logical NNNNNN on /dev/mapper/[crypto]


I assume I have a bad block device. Does that seem correct? The
important data is backed up.

However, it would save me a lot of time reinstalling the operating
system and setting up my work environment if I can copy this root
filesystem to another storage device.

Can I do that, considering the errors I have mentioned?? With the
uncorrectable error being in a metadata node, what (if anything) does
that imply about restoring from this drive?

If I can copy this entire root filesystem, what is the best way to do
it? The btrfs restore tool? cp? rsync? Some cloning tool? Other
options?

If I use the btrfs restore tool, should I use options x, m and S? In
particular I wonder exactly what the S option does. If I leave S out,
are all symlinks ignored?

I'm trying to save time and clone this so that I get the operating
system and all my tweaks / configurations back. As I said, the really
important data is separately backed up.

I appreciate all suggestions.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-10  3:27 checksum error in metadata node - best way to move root fs to new drive? Dave T
@ 2016-08-10  6:27 ` Duncan
  2016-08-10 19:46   ` Austin S. Hemmelgarn
  2016-08-10 21:21   ` Chris Murphy
  2016-08-10 21:15 ` Chris Murphy
  1 sibling, 2 replies; 28+ messages in thread
From: Duncan @ 2016-08-10  6:27 UTC (permalink / raw)
  To: linux-btrfs

Dave T posted on Tue, 09 Aug 2016 23:27:56 -0400 as excerpted:

> btrfs scrub returned with uncorrectable errors. Searching in dmesg
> returns the following information:
> 
> BTRFS warning (device dm-0): checksum error at logical NNNNN on
> /dev/mapper/[crypto] sector: yyyyy metadata node (level 2) in tree 250
> 
> it also says:
> 
> unable to fixup (regular) error at logical NNNNNN on
> /dev/mapper/[crypto]
> 
> 
> I assume I have a bad block device. Does that seem correct? The
> important data is backed up.
> 
> However, it would save me a lot of time reinstalling the operating
> system and setting up my work environment if I can copy this root
> filesystem to another storage device.
> 
> Can I do that, considering the errors I have mentioned?? With the
> uncorrectable error being in a metadata node, what (if anything) does
> that imply about restoring from this drive?

Well, given that I don't see any other people more qualified than I, as a 
simple btrfs user and list regular, tho not a dmcrypt user and definitely 
not a btrfs dev, posting, I'll try to help, but...

Do you know what data and metadata replication modes you were using?  
Scrub detects checksum errors, and for raid1 mode on multi-device (but I 
guess you were single device) and dup mode on single device, it will try 
the other copy and use it if the checksum passes there, repairing the bad 
copy as well.

But until recently dup mode data on single device was impossible, so I 
doubt you were using that, and while dup mode metadata was the normal 
default, on ssd that changes to single mode as well.

Which means if you were using ssd defaults, you got single mode for both 
data and metadata, and scrub can detect but not correct checksum errors.

That doesn't directly answer your question, but it does explain why/that 
you couldn't /expect/ scrub to fix checksum problems, only detect them, 
if both data and metadata are single mode.

Meanwhile, in a different post you asked about btrfs on dmcrypt.  I'm not 
aware of any direct btrfs-on-dmcrypt specific bugs (tho I'm just a btrfs 
user and list regular, not a dev, so could have missed something), but 
certainly, the dmcrypt layer doesn't simplify things.  There was a guy 
here, Mark MERLIN, worked for google I believe and was on the road 
frequently, that was using btrfs on dmcrypt for his laptop and various 
btrfs on his servers as well -- he wrote some of the raid56 mode stuff on 
the wiki based on his own experiments with it.  But I haven't seen him 
around recently.  I'd suggest he'd be the guy to talk to about btrfs on 
dmcrypt if you can get in contact with him, as he seemed to have more 
experience with it than anyone else around here.  But like I said I 
haven't seen him around recently...

Put it this way.  If it were my data on the line, I'd either (1) use 
another filesystem on top of dmcrypt, if I really wanted/needed the 
crypted layer, or (2) do without the crypted layer, or (3) use btrfs but 
be extra vigilant with backups.  This since while I know of no specific 
bugs in btrfs-on-dmcrypt case, I don't particularly trust it either, and 
Marc MERLIN's posted troubles with the combo were enough to have me 
avoiding it if possible, and being extra careful with backups if not.

> If I can copy this entire root filesystem, what is the best way to do
> it? The btrfs restore tool? cp? rsync? Some cloning tool? Other options?

It depends on if the filesystem is mountable and if so, how much can be 
retrieved without error, the latter of which depends on the extent of 
that metadata damage, since damaged metadata will likely take out 
multiple files, and depending on what level of the tree the damage was 
on, it could take out only a few files, or most of the filesystem!

If you can mount and the damage appears to be limited, I'd try mounting 
read-only and copying what I could off, using conventional methods.  That 
way you get checksum protection, which should help assure that anything 
successfully copied isn't corrupted, because btrfs will error out if 
there's checksum errors and it won't copy successfully.

If it won't mount or it will but the damage appears to be extensive, I'd 
suggest using restore.  It's read-only in terms of the filesystem it's 
restoring from, so shouldn't cause further damage -- unless the device is 
actively decaying as you use it, in which case the first thing I'd try to 
do is image it to something else so the damage isn't getting worse as you 
work with it.

But AFAIK restore doesn't give you the checksum protection, so anything 
restored that way /could/ be corrupt (tho it's worth noting that ordinary 
filesystems don't do checksum protection anyway, so it's important not to 
consider the file any more damaged just because it wasn't checksum 
protected than it would be if you simply retrieved it from say an ext4 
filesystem and didn't have some other method to verify the file).

Altho... working on dmcrypt, I suppose it's likely that anything that's 
corrupted turns up entirely scrambled and useless anyway -- you may not 
be able to retrieve for example a video file with some dropouts as may be 
the case on unencrypted storage, but have a totally scrambled and useless 
file, or at least that file block (4K), instead.

> If I use the btrfs restore tool, should I use options x, m and S? In
> particular I wonder exactly what the S option does. If I leave S out,
> are all symlinks ignored?

Symlinks are not restored without -S, correct.  That and -m are both 
relatively new restore options -- back when I first used restore you 
simply didn't get that back.

If it's primarily just data files and you don't really care about 
ownership/permissions or date metadata, you can leave the -m off to 
simplify the process slightly.  In that case, the files will be written 
just as any other new file would be written, as the user (root) the app 
is running as, subject to the current umask.  Else use the -m and restore 
will try to restore ownership/permissions/dates metadata as well.

Similarly, you may or may not need -x for the extended attributes.  
Unless you're using selinux and its security attributes, or capacities to 
avoid running as superuser (and those both apply primarily to 
executables), chances are fairly good that unless you specifically know 
you need extended attributes restored, you don't, and can skip that 
option.

> I'm trying to save time and clone this so that I get the operating
> system and all my tweaks / configurations back. As I said, the really
> important data is separately backed up.

Good. =:^)

Sounds about like me.  I do periodic backups, but have run restore a 
couple times when a filesystem wouldn't mount, in ordered to get back as 
much of the delta between the last backup and current as possible.  Of 
course I know not doing more frequent backups is a calculated risk and I 
was prepared to have to redo anything changed since the backup if 
necessary, but it's nice to have a tool like btrfs restore that can make 
it unnecessary under certain conditions where it otherwise would be. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-10  6:27 ` Duncan
@ 2016-08-10 19:46   ` Austin S. Hemmelgarn
  2016-08-10 21:21   ` Chris Murphy
  1 sibling, 0 replies; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2016-08-10 19:46 UTC (permalink / raw)
  To: linux-btrfs

On 2016-08-10 02:27, Duncan wrote:
> Dave T posted on Tue, 09 Aug 2016 23:27:56 -0400 as excerpted:
>
>> btrfs scrub returned with uncorrectable errors. Searching in dmesg
>> returns the following information:
>>
>> BTRFS warning (device dm-0): checksum error at logical NNNNN on
>> /dev/mapper/[crypto] sector: yyyyy metadata node (level 2) in tree 250
>>
>> it also says:
>>
>> unable to fixup (regular) error at logical NNNNNN on
>> /dev/mapper/[crypto]
>>
>>
>> I assume I have a bad block device. Does that seem correct? The
>> important data is backed up.
>>
>> However, it would save me a lot of time reinstalling the operating
>> system and setting up my work environment if I can copy this root
>> filesystem to another storage device.
>>
>> Can I do that, considering the errors I have mentioned?? With the
>> uncorrectable error being in a metadata node, what (if anything) does
>> that imply about restoring from this drive?
>
> Well, given that I don't see any other people more qualified than I, as a
> simple btrfs user and list regular, tho not a dmcrypt user and definitely
> not a btrfs dev, posting, I'll try to help, but...
I probably would have replied, if I had seen the e-mail before now. 
GMail apparently really hates me recently, as I keep getting things 
hours to days after other people and regularly out of order...

As usual though, you seem to have already covered everything important 
pretty well, I've only got a few comments to add below.
>
> Do you know what data and metadata replication modes you were using?
> Scrub detects checksum errors, and for raid1 mode on multi-device (but I
> guess you were single device) and dup mode on single device, it will try
> the other copy and use it if the checksum passes there, repairing the bad
> copy as well.
>
> But until recently dup mode data on single device was impossible, so I
> doubt you were using that, and while dup mode metadata was the normal
> default, on ssd that changes to single mode as well.
>
> Which means if you were using ssd defaults, you got single mode for both
> data and metadata, and scrub can detect but not correct checksum errors.
>
> That doesn't directly answer your question, but it does explain why/that
> you couldn't /expect/ scrub to fix checksum problems, only detect them,
> if both data and metadata are single mode.
>
> Meanwhile, in a different post you asked about btrfs on dmcrypt.  I'm not
> aware of any direct btrfs-on-dmcrypt specific bugs (tho I'm just a btrfs
> user and list regular, not a dev, so could have missed something), but
> certainly, the dmcrypt layer doesn't simplify things.  There was a guy
> here, Mark MERLIN, worked for google I believe and was on the road
> frequently, that was using btrfs on dmcrypt for his laptop and various
> btrfs on his servers as well -- he wrote some of the raid56 mode stuff on
> the wiki based on his own experiments with it.  But I haven't seen him
> around recently.  I'd suggest he'd be the guy to talk to about btrfs on
> dmcrypt if you can get in contact with him, as he seemed to have more
> experience with it than anyone else around here.  But like I said I
> haven't seen him around recently...
>
> Put it this way.  If it were my data on the line, I'd either (1) use
> another filesystem on top of dmcrypt, if I really wanted/needed the
> crypted layer, or (2) do without the crypted layer, or (3) use btrfs but
> be extra vigilant with backups.  This since while I know of no specific
> bugs in btrfs-on-dmcrypt case, I don't particularly trust it either, and
> Marc MERLIN's posted troubles with the combo were enough to have me
> avoiding it if possible, and being extra careful with backups if not.
As far as dm-crypt goes, it looks like BTRFS is stable on top in the 
configuration I use (aex-xts-plain64 with a long key using plain 
dm-crypt instead of LUKS).  I have heard rumors of issues when using 
LUKS without hardware acceleration, but I've never seen any conclusive 
proof, and what little I've heard sounds more like it was just race 
conditions elsewhere causing the issues.
>
>> If I can copy this entire root filesystem, what is the best way to do
>> it? The btrfs restore tool? cp? rsync? Some cloning tool? Other options?
>
> It depends on if the filesystem is mountable and if so, how much can be
> retrieved without error, the latter of which depends on the extent of
> that metadata damage, since damaged metadata will likely take out
> multiple files, and depending on what level of the tree the damage was
> on, it could take out only a few files, or most of the filesystem!
>
> If you can mount and the damage appears to be limited, I'd try mounting
> read-only and copying what I could off, using conventional methods.  That
> way you get checksum protection, which should help assure that anything
> successfully copied isn't corrupted, because btrfs will error out if
> there's checksum errors and it won't copy successfully.
>
> If it won't mount or it will but the damage appears to be extensive, I'd
> suggest using restore.  It's read-only in terms of the filesystem it's
> restoring from, so shouldn't cause further damage -- unless the device is
> actively decaying as you use it, in which case the first thing I'd try to
> do is image it to something else so the damage isn't getting worse as you
> work with it.
>
> But AFAIK restore doesn't give you the checksum protection, so anything
> restored that way /could/ be corrupt (tho it's worth noting that ordinary
> filesystems don't do checksum protection anyway, so it's important not to
> consider the file any more damaged just because it wasn't checksum
> protected than it would be if you simply retrieved it from say an ext4
> filesystem and didn't have some other method to verify the file).
>
> Altho... working on dmcrypt, I suppose it's likely that anything that's
> corrupted turns up entirely scrambled and useless anyway -- you may not
> be able to retrieve for example a video file with some dropouts as may be
> the case on unencrypted storage, but have a totally scrambled and useless
> file, or at least that file block (4K), instead.
This may or may not be the case, it really depends on how dm-crypt is 
set up, and a bunch of other factors.  The chance of this happening is 
higher with dm-crypt, but it's still not a certainty.
>
>> If I use the btrfs restore tool, should I use options x, m and S? In
>> particular I wonder exactly what the S option does. If I leave S out,
>> are all symlinks ignored?
>
> Symlinks are not restored without -S, correct.  That and -m are both
> relatively new restore options -- back when I first used restore you
> simply didn't get that back.
>
> If it's primarily just data files and you don't really care about
> ownership/permissions or date metadata, you can leave the -m off to
> simplify the process slightly.  In that case, the files will be written
> just as any other new file would be written, as the user (root) the app
> is running as, subject to the current umask.  Else use the -m and restore
> will try to restore ownership/permissions/dates metadata as well.
>
> Similarly, you may or may not need -x for the extended attributes.
> Unless you're using selinux and its security attributes, or capacities to
> avoid running as superuser (and those both apply primarily to
> executables), chances are fairly good that unless you specifically know
> you need extended attributes restored, you don't, and can skip that
> option.
There are a few other cases where they are important, but most of them 
are big data-center type things.  The big one I can think of off the top 
of my head is when using GlusterFS on top of BTRFS, as Gluster stores 
synchronization info in xattrs.  I'm pretty certain Ceph does too.  In 
general though, if it's just a workstation, you probably don't need 
xattrs unless you use a security module (like SELinux, IMA, or EVM), 
file capabilities (ping almost certainly does on your system, but I 
doubt anything else does, and ping won't break without them), or are 
using ACL's (or Samba, it stores Windows style ACE's in xattrs, but it 
doesn't do so by default, and setting that up right is complicated).

If you can afford to wait a bit longer, it's probably better to use -x, 
because most of the things that break in the face of missing xattrs tend 
to break rather spectacularly.
>
>> I'm trying to save time and clone this so that I get the operating
>> system and all my tweaks / configurations back. As I said, the really
>> important data is separately backed up.
>
> Good. =:^)
>
> Sounds about like me.  I do periodic backups, but have run restore a
> couple times when a filesystem wouldn't mount, in ordered to get back as
> much of the delta between the last backup and current as possible.  Of
> course I know not doing more frequent backups is a calculated risk and I
> was prepared to have to redo anything changed since the backup if
> necessary, but it's nice to have a tool like btrfs restore that can make
> it unnecessary under certain conditions where it otherwise would be. =:^)
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-10  3:27 checksum error in metadata node - best way to move root fs to new drive? Dave T
  2016-08-10  6:27 ` Duncan
@ 2016-08-10 21:15 ` Chris Murphy
  2016-08-10 22:50   ` Dave T
  1 sibling, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2016-08-10 21:15 UTC (permalink / raw)
  To: Dave T; +Cc: Btrfs BTRFS

On Tue, Aug 9, 2016 at 9:27 PM, Dave T <davestechshop@gmail.com> wrote:
> btrfs scrub returned with uncorrectable errors. Searching in dmesg
> returns the following information:
>
> BTRFS warning (device dm-0): checksum error at logical NNNNN on
> /dev/mapper/[crypto] sector: yyyyy metadata node (level 2) in tree 250
>
> it also says:
>
> unable to fixup (regular) error at logical NNNNNN on /dev/mapper/[crypto]
>
>
> I assume I have a bad block device. Does that seem correct? The
> important data is backed up.

If it were persistently, blatantly bad, then the drive firmware would
know about it, and would report a read error. If you're not seeing
libata UNC errors, or the other way it manifests is with hard link
resets due to inappropriate SCSI command timer default in the kernel,
then it's probably some kind of SDC, torn or misdirected write, etc.
If metadata is profile DUP, then scrub should fix it. If it's not,
there's something else going on (or really bad luck).

I'd like to believe that btrfs check can, or someday will, be able to
do some kind of sanity check on a node that fails checksum, and fix
it. If the node can be read but merely fails checksum isn't a really
good reason for a file system to not give you access to its data, but
yeah it kinda depends on what's in the node. It could contain up to a
couple hundred items each of which point elsewhere.

btrfs-debug-tree -b <block number reported by error at logical> <dev>
might give some hint what's going on. I'd like to believe it'll be
noisy and warn the checksum fails but still show the contents assuming
the drive hands over the data on those sectors.


> If I can copy this entire root filesystem, what is the best way to do
> it? The btrfs restore tool? cp? rsync? Some cloning tool? Other
> options?

0. Backup, that's done.
1. Report 'btrfs check' without --repair, let's see what it complains
about and if it might be able to plausibly fix this.

Since you can scrub, it means the file system mounts. Since the file
system mounts, I would not look at restore to start out because it's
tedious. I'd say you toss a coin over using btrfs send/receive, or
btrfs check --repair to see if it fixes the node. These days it should
be safe with relatively recent btrfs-progs so I'd say use a 4.6.x or
4.7 progs for this. And then the send/receive should be done with -v
or maybe even -vv for both send and receive, along with --max-errors
0, which will permit unlimited errors but will report them rather than
failing midstream. This will get you the bulk of the OS.

If you're lucky, the node contains only a handful of relatively
unimportant items, especially if they're files small enough to be
stored inline the node, which will substantially reduce the number of
errors as a result of a single node loss.

The calculus on btrfs check --repair first then send receive, vs
send/receive then if that fails fallback to btrfs check --repair, is
mainly time. Maybe repair can fix it, maybe it makes things worse.
Where send/receive might fail midstream without the node being fixed
first, but it causes no additional problems. The 2nd is more
conservative but takes more time if it turns out the send/receive
fails, you then do repair, and then have to start the send/receive
over from scratch again. (If it fails, you should delete or rename the
bad subvolume on the receive side before starting another send).


> If I use the btrfs restore tool, should I use options x, m and S? In
> particular I wonder exactly what the S option does. If I leave S out,
> are all symlinks ignored?

I would only use restore for the files that are reported by
send/receive as failed due to errors - assuming that even happens. Or
since this is OS stuff, just reinstall the packages for the files
affected by the bad node.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-10  6:27 ` Duncan
  2016-08-10 19:46   ` Austin S. Hemmelgarn
@ 2016-08-10 21:21   ` Chris Murphy
  2016-08-10 22:01     ` Dave T
  2016-08-12 17:00     ` Patrik Lundquist
  1 sibling, 2 replies; 28+ messages in thread
From: Chris Murphy @ 2016-08-10 21:21 UTC (permalink / raw)
  Cc: Btrfs BTRFS

I'm using LUKS, aes xts-plain64, on six devices. One is using mixed-bg
single device. One is dsingle mdup. And then 2x2 mraid1 draid1. I've
had zero problems. The two computers these run on do have aesni
support. Aging wise, they're all at least a  year old. But I've been
using Btrfs on LUKS for much longer than that.


Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-10 21:21   ` Chris Murphy
@ 2016-08-10 22:01     ` Dave T
  2016-08-10 22:23       ` Chris Murphy
  2016-08-11  4:50       ` Duncan
  2016-08-12 17:00     ` Patrik Lundquist
  1 sibling, 2 replies; 28+ messages in thread
From: Dave T @ 2016-08-10 22:01 UTC (permalink / raw)
  To: Chris Murphy, Duncan, ahferroin7; +Cc: Btrfs BTRFS

Thanks for all the responses, guys! I really appreciate it. This
information is very helpful. I will be working through the suggestions
(e.g., check without repair) for the next hour or so. I'll report back
when I have something to report.

My drive is a Samsung 950 Pro nvme drive, which in most respects is
treated like an SSD. (the only difference I am aware of is that trim
isn't needed).

> But until recently dup mode data on single device was impossible, so I
> doubt you were using that, and while dup mode metadata was the normal
> default, on ssd that changes to single mode as well.

Your assumptions are correct: single mode for data and metadata.

Does anyone have any thoughts about using dup mode for metadata on a
Samsung 950 Pro (or any NVMe drive)?

I will be very disappointed if I cannot use btrfs + dm-crypt. As far
as I can see, there is no alternative given that I need to use
snapshots (and LVM, as good as it is, has severe performance penalties
for its snapshots). I'm required to use crypto. I cannot risk doing
without snapshots. Therefore, btrfs + dm-crypt seem like my only
viable solution. Plus it is my preferred solution. I like both tools.

If all goes well, we are planning to implement a production file
server for our office with dm-crypt + btrfs (and a lot fo spinning
disks).

In the office we currently have another system identical to mine
running the same drive with dm-crypt + btrfs, the same operating
system, the same nvidia GPU and properitary driver and it is running
fine. One difference is that it is overclocked substantially (mine
isn't). I would have expected it would give a problem before mine
would. But it seems to be rock solid. I just ran btrfs scrub on it and
it finished in a few seconds with no errors.

On my computer I have run two extensive memory tests (8 cpu cores in
parallel, all tests). The current test has been running for 14 hrs
with no errors. (I think that 8 cores in parallel make this equivalent
to a much longer test with the default single cpu settings.)
Therefore, I do not beieve this issue is caused by RAM.

I'm hoping there is no configuration error or other mistake I made in
setting these systems up that would lead to the problems I'm
experiencing.

BTW, I was able to copy all the files to another drive with no
problem. I used "cp -a" to copy, then I ran "rsync -a" twiice to make
sure nothing was missed. My guess is that I'll be able to copy this
right back onto the root filesystem after I resolve whatever the
problem is and my operating system will be back to the same state it
was in prior to this problem.

OK, I'm off to try btrfs check without --repair... thanks again!

For reference:

btrfs-progs v4.6.1
Linux 4.6.4-1-ARCH #1 SMP PREEMPT Mon Jul 11 19:12:32 CEST 2016 x86_64 GNU/Linux



On Wed, Aug 10, 2016 at 5:21 PM, Chris Murphy <lists@colorremedies.com> wrote:
> I'm using LUKS, aes xts-plain64, on six devices. One is using mixed-bg
> single device. One is dsingle mdup. And then 2x2 mraid1 draid1. I've
> had zero problems. The two computers these run on do have aesni
> support. Aging wise, they're all at least a  year old. But I've been
> using Btrfs on LUKS for much longer than that.
>
>
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-10 22:01     ` Dave T
@ 2016-08-10 22:23       ` Chris Murphy
  2016-08-10 22:52         ` Dave T
  2016-08-11  7:18         ` Andrei Borzenkov
  2016-08-11  4:50       ` Duncan
  1 sibling, 2 replies; 28+ messages in thread
From: Chris Murphy @ 2016-08-10 22:23 UTC (permalink / raw)
  To: Dave T; +Cc: Chris Murphy, Duncan, Austin Hemmelgarn, Btrfs BTRFS

On Wed, Aug 10, 2016 at 4:01 PM, Dave T <davestechshop@gmail.com> wrote:

> I will be very disappointed if I cannot use btrfs + dm-crypt. As far
> as I can see, there is no alternative given that I need to use
> snapshots (and LVM, as good as it is, has severe performance penalties
> for its snapshots).

See LVM thin provisioning snapshots. I haven't benchmarked it, but
it's a night and day difference from conventional (thick) snapshots.
The gotchas are currently there's no raid support, and the snapshots
are whole volume. So each snapshot appears as a volume with the same
UUID as the original, and by default they're not active. So for me
it's a bit of a head scratcher what happens when mounting a snapshot
concurrent with another. For Btrfs this ends badly. For XFS it refuses
unless using nouuid, but still seems capable of writing to the two
volumes without causing problems.

But yes, I like Btrfs snapshots and refinks better. *shrug*

If you find a Btrfs on dmcrypt problem, it's a serious bug, and I
think it would get attention very quickly.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-10 21:15 ` Chris Murphy
@ 2016-08-10 22:50   ` Dave T
  0 siblings, 0 replies; 28+ messages in thread
From: Dave T @ 2016-08-10 22:50 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

see below

On Wed, Aug 10, 2016 at 5:15 PM, Chris Murphy <lists@colorremedies.com> wrote:

> 1. Report 'btrfs check' without --repair, let's see what it complains
> about and if it might be able to plausibly fix this.

First, a small part of the dmesg output:

[  172.772283] Btrfs loaded
[  172.772632] BTRFS: device label top_level devid 1 transid 103495 /dev/dm-0
[  274.320762] BTRFS info (device dm-0): use lzo compression
[  274.320764] BTRFS info (device dm-0): disk space caching is enabled
[  274.320764] BTRFS: has skinny extents
[  274.322555] BTRFS info (device dm-0): bdev /dev/mapper/sysluks
errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
[  274.329965] BTRFS: detected SSD devices, enabling SSD mode


Now, full output of btrfs check without repair option.


checking extents
bad metadata [292414541824, 292414558208) crossing stripe boundary
bad metadata [292414607360, 292414623744) crossing stripe boundary
bad metadata [292414672896, 292414689280) crossing stripe boundary
bad metadata [292414738432, 292414754816) crossing stripe boundary
bad metadata [292415787008, 292415803392) crossing stripe boundary
bad metadata [292415918080, 292415934464) crossing stripe boundary
bad metadata [292416376832, 292416393216) crossing stripe boundary
bad metadata [292418015232, 292418031616) crossing stripe boundary
bad metadata [292419325952, 292419342336) crossing stripe boundary
bad metadata [292419588096, 292419604480) crossing stripe boundary
bad metadata [292419915776, 292419932160) crossing stripe boundary
bad metadata [292422930432, 292422946816) crossing stripe boundary
bad metadata [292423061504, 292423077888) crossing stripe boundary
ref mismatch on [292423155712 16384] extent item 1, found 0
Backref 292423155712 root 258 not referenced back 0x2280a20
Incorrect global backref count on 292423155712 found 1 wanted 0
backpointer mismatch on [292423155712 16384]
owner ref check failed [292423155712 16384]
bad metadata [292423192576, 292423208960) crossing stripe boundary
bad metadata [292423323648, 292423340032) crossing stripe boundary
bad metadata [292429549568, 292429565952) crossing stripe boundary
bad metadata [292439904256, 292439920640) crossing stripe boundary
bad metadata [292440297472, 292440313856) crossing stripe boundary
bad metadata [292442525696, 292442542080) crossing stripe boundary
bad metadata [292443770880, 292443787264) crossing stripe boundary
bad metadata [292443967488, 292443983872) crossing stripe boundary
bad metadata [292444033024, 292444049408) crossing stripe boundary
bad metadata [292444098560, 292444114944) crossing stripe boundary
bad metadata [292444164096, 292444180480) crossing stripe boundary
bad metadata [292444229632, 292444246016) crossing stripe boundary
bad metadata [292444688384, 292444704768) crossing stripe boundary
bad metadata [292444884992, 292444901376) crossing stripe boundary
bad metadata [292445081600, 292445097984) crossing stripe boundary
bad metadata [292446720000, 292446736384) crossing stripe boundary
bad metadata [292448948224, 292448964608) crossing stripe boundary
Error: could not find btree root extent for root 258
Checking filesystem on /dev/mapper/cryptroot
UUID:

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-10 22:23       ` Chris Murphy
@ 2016-08-10 22:52         ` Dave T
  2016-08-11 14:12           ` Nicholas D Steeves
  2016-08-11  7:18         ` Andrei Borzenkov
  1 sibling, 1 reply; 28+ messages in thread
From: Dave T @ 2016-08-10 22:52 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Duncan, Austin Hemmelgarn, Btrfs BTRFS

Apologies. I have to make a correction to the message I just sent.
Disregard that message and use this one:


On Wed, Aug 10, 2016 at 5:15 PM, Chris Murphy <lists@colorremedies.com> wrote:

> 1. Report 'btrfs check' without --repair, let's see what it complains
> about and if it might be able to plausibly fix this.

First, a small part of the dmesg output:

[  172.772283] Btrfs loaded
[  172.772632] BTRFS: device label top_level devid 1 transid 103495 /dev/dm-0
[  274.320762] BTRFS info (device dm-0): use lzo compression
[  274.320764] BTRFS info (device dm-0): disk space caching is enabled
[  274.320764] BTRFS: has skinny extents
[  274.322555] BTRFS info (device dm-0): bdev /dev/mapper/cryptroot
errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
[  274.329965] BTRFS: detected SSD devices, enabling SSD mode


Now, full output of btrfs check without repair option.


checking extents
bad metadata [292414541824, 292414558208) crossing stripe boundary
bad metadata [292414607360, 292414623744) crossing stripe boundary
bad metadata [292414672896, 292414689280) crossing stripe boundary
bad metadata [292414738432, 292414754816) crossing stripe boundary
bad metadata [292415787008, 292415803392) crossing stripe boundary
bad metadata [292415918080, 292415934464) crossing stripe boundary
bad metadata [292416376832, 292416393216) crossing stripe boundary
bad metadata [292418015232, 292418031616) crossing stripe boundary
bad metadata [292419325952, 292419342336) crossing stripe boundary
bad metadata [292419588096, 292419604480) crossing stripe boundary
bad metadata [292419915776, 292419932160) crossing stripe boundary
bad metadata [292422930432, 292422946816) crossing stripe boundary
bad metadata [292423061504, 292423077888) crossing stripe boundary
ref mismatch on [292423155712 16384] extent item 1, found 0
Backref 292423155712 root 258 not referenced back 0x2280a20
Incorrect global backref count on 292423155712 found 1 wanted 0
backpointer mismatch on [292423155712 16384]
owner ref check failed [292423155712 16384]
bad metadata [292423192576, 292423208960) crossing stripe boundary
bad metadata [292423323648, 292423340032) crossing stripe boundary
bad metadata [292429549568, 292429565952) crossing stripe boundary
bad metadata [292439904256, 292439920640) crossing stripe boundary
bad metadata [292440297472, 292440313856) crossing stripe boundary
bad metadata [292442525696, 292442542080) crossing stripe boundary
bad metadata [292443770880, 292443787264) crossing stripe boundary
bad metadata [292443967488, 292443983872) crossing stripe boundary
bad metadata [292444033024, 292444049408) crossing stripe boundary
bad metadata [292444098560, 292444114944) crossing stripe boundary
bad metadata [292444164096, 292444180480) crossing stripe boundary
bad metadata [292444229632, 292444246016) crossing stripe boundary
bad metadata [292444688384, 292444704768) crossing stripe boundary
bad metadata [292444884992, 292444901376) crossing stripe boundary
bad metadata [292445081600, 292445097984) crossing stripe boundary
bad metadata [292446720000, 292446736384) crossing stripe boundary
bad metadata [292448948224, 292448964608) crossing stripe boundary
Error: could not find btree root extent for root 258
Checking filesystem on /dev/mapper/cryptroot

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-10 22:01     ` Dave T
  2016-08-10 22:23       ` Chris Murphy
@ 2016-08-11  4:50       ` Duncan
  2016-08-11  5:06         ` Gareth Pye
  1 sibling, 1 reply; 28+ messages in thread
From: Duncan @ 2016-08-11  4:50 UTC (permalink / raw)
  To: linux-btrfs

Dave T posted on Wed, 10 Aug 2016 18:01:44 -0400 as excerpted:

> Does anyone have any thoughts about using dup mode for metadata on a
> Samsung 950 Pro (or any NVMe drive)?

The biggest problem with dup on ssds is that some ssds (particularly the 
ones with the sandforce controllers) do dedup, so you'd be having btrfs 
do dup while the filesystem dedups, to no effect except more cpu and 
device processing!

(The other argument for single on ssd that I've seen is that because the 
FTL ultimately places the data, and because both copies are written at 
the same time, there's a good chance that the FTL will write them into 
the same erase block and area, and a defect in one will likely be a 
defect in the other as well.  That may or may not be, I'm not qualified 
to say, but as explained below, I do choose to take my chances on that 
and thus do run dup on ssd.)

So as long as the SSD doesn't have a deduping FTL, I'd suggest dup for 
metadata on ssd does make sense.  Data... not so sure on, but certainly 
metadata, because one bad block of metadata can be many messed up files.

On my ssds here, which I know don't do dedup, most of my btrfs are raid1 
on the pair of ssds.  However, /boot is different since I can't really 
point grub at two different /boots, so I have my working /boot on one 
device, with the backup /boot on the other, and the grub on each one 
pointed at its respective /boot, so I can select working or backup /boot 
from the BIOS and it'll just work.  Since /boot is so small, it's mixed-
mode chunks, meaning data and metadata are mixed together and the 
redundancy mode applies to both at once instead of each separately.  And 
I chose dup, so it's dup for both data and metadata.

Works fine, dup for both data and metadata on non-deduping ssds, but of 
course that means data takes double the space since there's two copies of 
it, and that gets kind of expensive on ssd, if it's more than the 
fraction of a GiB that's /boot.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-11  4:50       ` Duncan
@ 2016-08-11  5:06         ` Gareth Pye
  2016-08-11  8:20           ` Duncan
  0 siblings, 1 reply; 28+ messages in thread
From: Gareth Pye @ 2016-08-11  5:06 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

Is there some simple muddling of meta data that could be done to force
dup meta data on deduping SSDs? Like a simple 'random' byte repeated
often enough it would defeat any sane dedup? I know it would waste
data but clearly that is considered worth it with dup metadata (what
is the difference between 50% metadata efficiency and 45%?)

On Thu, Aug 11, 2016 at 2:50 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Dave T posted on Wed, 10 Aug 2016 18:01:44 -0400 as excerpted:
>
>> Does anyone have any thoughts about using dup mode for metadata on a
>> Samsung 950 Pro (or any NVMe drive)?
>
> The biggest problem with dup on ssds is that some ssds (particularly the
> ones with the sandforce controllers) do dedup, so you'd be having btrfs
> do dup while the filesystem dedups, to no effect except more cpu and
> device processing!
>
> (The other argument for single on ssd that I've seen is that because the
> FTL ultimately places the data, and because both copies are written at
> the same time, there's a good chance that the FTL will write them into
> the same erase block and area, and a defect in one will likely be a
> defect in the other as well.  That may or may not be, I'm not qualified
> to say, but as explained below, I do choose to take my chances on that
> and thus do run dup on ssd.)
>
> So as long as the SSD doesn't have a deduping FTL, I'd suggest dup for
> metadata on ssd does make sense.  Data... not so sure on, but certainly
> metadata, because one bad block of metadata can be many messed up files.
>
> On my ssds here, which I know don't do dedup, most of my btrfs are raid1
> on the pair of ssds.  However, /boot is different since I can't really
> point grub at two different /boots, so I have my working /boot on one
> device, with the backup /boot on the other, and the grub on each one
> pointed at its respective /boot, so I can select working or backup /boot
> from the BIOS and it'll just work.  Since /boot is so small, it's mixed-
> mode chunks, meaning data and metadata are mixed together and the
> redundancy mode applies to both at once instead of each separately.  And
> I chose dup, so it's dup for both data and metadata.
>
> Works fine, dup for both data and metadata on non-deduping ssds, but of
> course that means data takes double the space since there's two copies of
> it, and that gets kind of expensive on ssd, if it's more than the
> fraction of a GiB that's /boot.
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Gareth Pye - blog.cerberos.id.au
Level 2 MTG Judge, Melbourne, Australia

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-10 22:23       ` Chris Murphy
  2016-08-10 22:52         ` Dave T
@ 2016-08-11  7:18         ` Andrei Borzenkov
  1 sibling, 0 replies; 28+ messages in thread
From: Andrei Borzenkov @ 2016-08-11  7:18 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Dave T, Duncan, Austin Hemmelgarn, Btrfs BTRFS

On Thu, Aug 11, 2016 at 1:23 AM, Chris Murphy <lists@colorremedies.com> wrote:
> On Wed, Aug 10, 2016 at 4:01 PM, Dave T <davestechshop@gmail.com> wrote:
>
>> I will be very disappointed if I cannot use btrfs + dm-crypt. As far
>> as I can see, there is no alternative given that I need to use
>> snapshots (and LVM, as good as it is, has severe performance penalties
>> for its snapshots).
>
> See LVM thin provisioning snapshots. I haven't benchmarked it, but
> it's a night and day difference from conventional (thick) snapshots.
> The gotchas are currently there's no raid support, and the snapshots
> are whole volume. So each snapshot appears as a volume with the same
> UUID as the original, and by default they're not active. So for me
> it's a bit of a head scratcher what happens when mounting a snapshot
> concurrent with another. For Btrfs this ends badly. For XFS it refuses
> unless using nouuid, but still seems capable of writing to the two
> volumes without causing problems.
>

XFS now allows changing UUID, as do LVM and MD. We can also change
btrfs UUID using "btrfstune -u", but I wonder if there is any way to
change device UUID in this case.

One problem is that even before you come around doing it various udev
rules kick in and create links to wrong instance overwriting previous
ones; and I'm not sure either xfs_admin or btrfstune trigger change
event. So we may end up with stale completely wrong links.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-11  5:06         ` Gareth Pye
@ 2016-08-11  8:20           ` Duncan
  0 siblings, 0 replies; 28+ messages in thread
From: Duncan @ 2016-08-11  8:20 UTC (permalink / raw)
  To: linux-btrfs

Gareth Pye posted on Thu, 11 Aug 2016 15:06:48 +1000 as excerpted:

> Is there some simple muddling of meta data that could be done to force
> dup meta data on deduping SSDs? Like a simple 'random' byte repeated
> often enough it would defeat any sane dedup? I know it would waste data
> but clearly that is considered worth it with dup metadata (what is the
> difference between 50% metadata efficiency and 45%?)

Well, the FTLs are mostly proprietary, AFAIK, so it's probably hard to 
prove the "force", but given the 512-byte sector standard (some are a 
multiple of that these days but 512 should be the minimum), in theory one 
random byte out of every 512 should do it... unless the compression these 
deduping FTLs generally run as well catches that difference and 
compresses it out to a different location where it can be compactly 
stored, allowing multiple copies of the same 512-byte sector to be stored 
in a single sector, so long as they only had a single byte or two 
different.

So it could probably be done, but given that the deduping and compression 
features of these ssds are listed as just that, features, and that people 
buy them for that, it may be that it's better to simply leave well enough 
alone.  Folks who want dup metadata can set it, and if they haven't 
bought one of these ssds with dedup as a feature, they can be reasonably 
sure it'll be set.  And people who don't care will simply get the 
defaults and can live with them the same way that people that don't care 
generally live with defaults that may or may not be the absolute best 
case for them, but are generally at least not horrible.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-10 22:52         ` Dave T
@ 2016-08-11 14:12           ` Nicholas D Steeves
  2016-08-11 14:45             ` Austin S. Hemmelgarn
                               ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Nicholas D Steeves @ 2016-08-11 14:12 UTC (permalink / raw)
  To: Dave T; +Cc: Chris Murphy, Duncan, Austin Hemmelgarn, Btrfs BTRFS

Why is the combination of dm-crypt|luks+btrfs+compress=lzo as
overlooked as a potential cause?  Other than the "raid56 ate my data"
I've noticed a bunch of "luks+btrfs+compress=lzo ate my data" threads.

On 10 August 2016 at 15:46, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
>
> As far as dm-crypt goes, it looks like BTRFS is stable on top in the
> configuration I use (aex-xts-plain64 with a long key using plain dm-crypt
> instead of LUKS).  I have heard rumors of issues when using LUKS without
> hardware acceleration, but I've never seen any conclusive proof, and what
> little I've heard sounds more like it was just race conditions elsewhere
> causing the issues.
>

Austin, I'm very curious if they were also using compress=lzo, because
my informal hypothesis is that the encryption+btrfs+compress=lzo
combination precipitates these issues.  Maybe the combo is more likely
to trigger these race conditions?  It might also be neat to mine the
archive to see these seem to be more likely to occur with fast SSDs vs
slow rotational disks.  Do you use compress=lzo?

On 10 August 2016 at 18:52, Dave T <davestechshop@gmail.com> wrote:
> On Wed, Aug 10, 2016 at 5:15 PM, Chris Murphy <lists@colorremedies.com> wrote:
>
>> 1. Report 'btrfs check' without --repair, let's see what it complains
>> about and if it might be able to plausibly fix this.
>
> First, a small part of the dmesg output:
>
> [  172.772283] Btrfs loaded
> [  172.772632] BTRFS: device label top_level devid 1 transid 103495 /dev/dm-0
> [  274.320762] BTRFS info (device dm-0): use lzo compression

Compress=lzo confirmed.  Corruption occurred on an SSD.

On 10 August 2016 at 17:21, Chris Murphy <lists@colorremedies.com> wrote:
> I'm using LUKS, aes xts-plain64, on six devices. One is using mixed-bg
> single device. One is dsingle mdup. And then 2x2 mraid1 draid1. I've
> had zero problems. The two computers these run on do have aesni
> support. Aging wise, they're all at least a  year old. But I've been
> using Btrfs on LUKS for much longer than that.
>

Chris, do you use compress=lzo?  SSDs or rotational disks?

If a bunch of people are using this combo without issue, I'll drop the
informal hypothesis as "just a suspicion informed by sloppy pattern
recognition" ;-)

Thank you!
Nicholas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-11 14:12           ` Nicholas D Steeves
@ 2016-08-11 14:45             ` Austin S. Hemmelgarn
  2016-08-11 19:07             ` Duncan
  2016-08-11 20:33             ` Chris Murphy
  2 siblings, 0 replies; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2016-08-11 14:45 UTC (permalink / raw)
  To: Nicholas D Steeves, Dave T; +Cc: Chris Murphy, Duncan, Btrfs BTRFS

On 2016-08-11 10:12, Nicholas D Steeves wrote:
> Why is the combination of dm-crypt|luks+btrfs+compress=lzo as
> overlooked as a potential cause?  Other than the "raid56 ate my data"
> I've noticed a bunch of "luks+btrfs+compress=lzo ate my data" threads.
I haven't personally seen one of those in at least a few months.  In 
general, BTRFS is moving fast enough that reports older than a kernel 
release cycle are generally out of date unless something confirms 
otherwise, but I do distinctly recall such issues being commonly 
reported in the past.
>
> On 10 August 2016 at 15:46, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
>>
>> As far as dm-crypt goes, it looks like BTRFS is stable on top in the
>> configuration I use (aex-xts-plain64 with a long key using plain dm-crypt
>> instead of LUKS).  I have heard rumors of issues when using LUKS without
>> hardware acceleration, but I've never seen any conclusive proof, and what
>> little I've heard sounds more like it was just race conditions elsewhere
>> causing the issues.
>>
>
> Austin, I'm very curious if they were also using compress=lzo, because
> my informal hypothesis is that the encryption+btrfs+compress=lzo
> combination precipitates these issues.  Maybe the combo is more likely
> to trigger these race conditions?  It might also be neat to mine the
> archive to see these seem to be more likely to occur with fast SSDs vs
> slow rotational disks.  Do you use compress=lzo?
In my case, I've tested on both SSD's (both cheap low-end ones and good 
Intel and Crucial ones) and traditional hard drives, with and without 
compression (both zlib and lzo), and with a couple of different 
encryption algorithms (AES, Blowfish, and Threefish).  In my case It's 
only on plain dm-crypt, not LUKS, but I doubt that particular point will 
make much difference.  The last test I did was when the merge window for 
4.6 closed run as part of the regular regression testing I do, and I'll 
be doing another one in the near future.  I think the last time I saw 
any issues with this in my testing was prior to 4.0, but I don't 
remember for sure (most of what I care about is comparison to the 
previous version, so i don't keep much in the way of records of specific 
things).



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-11 14:12           ` Nicholas D Steeves
  2016-08-11 14:45             ` Austin S. Hemmelgarn
@ 2016-08-11 19:07             ` Duncan
  2016-08-11 20:43               ` Chris Murphy
  2016-08-11 20:33             ` Chris Murphy
  2 siblings, 1 reply; 28+ messages in thread
From: Duncan @ 2016-08-11 19:07 UTC (permalink / raw)
  To: linux-btrfs

Nicholas D Steeves posted on Thu, 11 Aug 2016 10:12:04 -0400 as excerpted:

> Why is the combination of dm-crypt|luks+btrfs+compress=lzo as overlooked
> as a potential cause?  Other than the "raid56 ate my data" I've noticed
> a bunch of "luks+btrfs+compress=lzo ate my data" threads.

My usage is btrfs on physical device (well, on GPT partitions on the 
physical device), no encryption, and it's mostly raid1 on paired devices, 
but there's definitely one kink that compress=lzo (and I believe 
compression in general, including gzip) adds, and it's possible running 
it on encryption compounds the issue.

The compression-related problem is this:  Btrfs is considerably less 
tolerant of checksum-related errors on btrfs-compressed data, and while 
on uncompressed btrfs raid1 it will recover from the second copy where 
possible and continue, on files that btrfs has compressed, if there are 
enough checksum errors, for example in a hard-shutdown situation where 
one of the raid1 devices had the updates written but it crashed while 
writing the other, btrfs will crash instead of simply falling back to the 
good copy.

This is known to be specific to compression; uncompressed btrfs recover 
as intended from the second copy.  And it's known to occur only when 
there's too many checksum errors in a burst -- the filesystem apparently 
deals correctly with just a few at a time.

This problem has been ongoing for years -- I thought it was just the way 
btrfs worked until someone mentioned that it didn't behave that way 
without compression -- and it reasonably regularly prevents a smooth 
reboot here after a crash.

In my case I have the system btrfs running read-only by default, so it's 
not damaged.  However, /home and /var/log are of course mounted writable, 
and that's where the problems come in.  If I start in (I believe) rescue 
mode (it's that or emergency, the other won't do the mounts and won't let 
me do them manually either, as it thinks a dependency is missing), 
systemd will do the mounts but not start the (permanent) logging or the 
services that need to routinely write stuff that I have symlinked into 
/home/var/whatever so they can write with a read-only root and system 
partition, I can then scrub the mounted home and log partitions to fix 
the checksum errors due to one device having the update while the other 
doesn't, and continue booting normally.  However, if I try directly 
booting normally, the system invariably crashes due to too many checksum 
errors, even when it /should/ simply read the other copy, which is fine 
as demonstrated by the fact that scrub can use it to fix the errors on 
the device triggering the checksum errors.

This continued to happen with 4.6.  I'm on 4.7 now but am not sure I've 
crashed with it and thus can't say for sure whether the problem is fixed 
there.  However, I doubt it, as the problem has been there apparently 
since the compression and raid1 features were introduced, and I didn't 
see anything mentioning a fix for the issue in the patches going by on 
the list.

The problem is most obvious and reproducible in btrfs raid1 mode, since 
there, one device /can/ be behind the other, and scrub /can/ be 
demonstrated to fix it so it's obviously a checksum issue, but I'd 
imagine if enough checksum mismatches happen on a single device in single 
mode, it would crash as well, and of course then there's no second copy 
for scrub to fix the bad copy from, so it would simply show up as a btrfs 
that can mount but with significant corruption issues that will crash the 
system if an attempt to read the affected blocks reads too many at a time.

And to whatever possible extent an encryption layer between the physical 
device and btrfs results in possible additional corruption in the event 
of a crash or hard shutdown, it could easily compound an already bad 
situation.

Meanwhile, /if/ that does turn out to be the root issue here, then 
finally fixing the btrfs compression related problem where a large burst 
of checksum failures crashes the system, even when there provably exists 
a second valid copy, but where this only happens with compression, should 
go quite far in stabilizing btrfs on encrypted underlayers.

I know I certainly wouldn't object to the problem being fixed. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-11 14:12           ` Nicholas D Steeves
  2016-08-11 14:45             ` Austin S. Hemmelgarn
  2016-08-11 19:07             ` Duncan
@ 2016-08-11 20:33             ` Chris Murphy
  2 siblings, 0 replies; 28+ messages in thread
From: Chris Murphy @ 2016-08-11 20:33 UTC (permalink / raw)
  To: Nicholas D Steeves
  Cc: Dave T, Chris Murphy, Duncan, Austin Hemmelgarn, Btrfs BTRFS

On Thu, Aug 11, 2016 at 8:12 AM, Nicholas D Steeves <nsteeves@gmail.com> wrote:

>
> Chris, do you use compress=lzo?  SSDs or rotational disks?

No compression, SSD and HDD. The stuff I care about is on dmcrypt
(LUKS) for some time. Stuff I sorta care about on plain partitions.
Stuff I don't care much about are either on LVM LV's (usually thinp),
or qcow2.

I have used compression for periods measured in months not years, both
zlib and lzo, on both SSD and HDD, to no ill effect. But it's true
some of the more abrupt and worse damaged file systems did use
compress=lzo. Since lzo is faster and only a bit less better
compression than zlib, it may be more people choose lzo and that's why
it turns out if there's a problem with compression it happens to be
lzo, coincidence rather than causation. I'm not even sure there's
enough information to have correlation.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-11 19:07             ` Duncan
@ 2016-08-11 20:43               ` Chris Murphy
  2016-08-12  3:11                 ` Duncan
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2016-08-11 20:43 UTC (permalink / raw)
  To: Duncan; +Cc: Btrfs BTRFS

On Thu, Aug 11, 2016 at 1:07 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> The compression-related problem is this:  Btrfs is considerably less
> tolerant of checksum-related errors on btrfs-compressed data,


Why? The data is the data. And why would it matter if it's application
compressed data vs Btrfs compressed data? If there's an error, Btrfs
is intolerant. I don't see how there's a checksum error that Btrfs
tolerates.

But also I don't know if the checksum is predicated on compressed data
or uncompressed data - does the scrub blindly read compressed data,
checksums it, and compares to the previously recorded csum? Or does
the scrub read compressed data, decompresses it, checksums it, then
compares? And does compression compress metadata? I don't think it
does from some of the squashfs testing of the same set of binary files
on ext4 vs btrfs uncompressed vs btrfs compressed. The difference is
explained by inline data being compressed (which it is), so I don't
think the fs itself gets compressed.


Chris Murphy

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-11 20:43               ` Chris Murphy
@ 2016-08-12  3:11                 ` Duncan
  2016-08-12  3:51                   ` Chris Murphy
  0 siblings, 1 reply; 28+ messages in thread
From: Duncan @ 2016-08-12  3:11 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Thu, 11 Aug 2016 14:43:56 -0600 as excerpted:

> On Thu, Aug 11, 2016 at 1:07 PM, Duncan <1i5t5.duncan@cox.net> wrote:
>> The compression-related problem is this:  Btrfs is considerably less
>> tolerant of checksum-related errors on btrfs-compressed data,
> 
> Why? The data is the data. And why would it matter if it's application
> compressed data vs Btrfs compressed data? If there's an error, Btrfs is
> intolerant. I don't see how there's a checksum error that Btrfs
> tolerates.

Apparently, the code path for compressed data is sufficiently different, 
that when there's a burst of checksum errors, even on raid1 where it 
should (and does with scrub) get the correct second copy, it will crash 
the system.  This is my experience and that of others, and what I thought 
was standard btrfs behavior -- I didn't know it was a compression-
specific bug since I use compress on all my btrfs, until someone told me.

When the btrfs compression option hasn't been used on that filesystem, or 
presumably when none of that burst of checksum errors is from btrfs-
compressed files, it will grab the second copy and use it as it should, 
and there will be no crash.  This is as reported by others, including 
people who have tested both with and without btrfs-compressed files and 
found that it only crashed if the files were btrfs-compressed, whereas it 
worked as expected, fetching the valid second copy, if they weren't btrfs-
compressed.

I'd assume this is why this particular bug has remained unsquashed for so 
long.  The devs are likely testing compression, and bad checksum data 
repair from the second copy, but they probably aren't testing bad 
checksum repair on compressed data, so the problem isn't showing up in 
their tests.  Between that and relatively few people running raid1 with 
the compression option and seeing enough bad shutdowns to be aware of the 
problem, it has mostly flown under the radar.  For a long time I myself 
thought it was just the way btrfs behaved with bursts of checksum errors, 
until someone pointed out that it did /not/ behave that way on btrfs that 
didn't have any compressed files when the checksum errors occurred.

> But also I don't know if the checksum is predicated on compressed data
> or uncompressed data - does the scrub blindly read compressed data,
> checksums it, and compares to the previously recorded csum? Or does the
> scrub read compressed data, decompresses it, checksums it, then
> compares? And does compression compress metadata? I don't think it does
> from some of the squashfs testing of the same set of binary files on
> ext4 vs btrfs uncompressed vs btrfs compressed. The difference is
> explained by inline data being compressed (which it is), so I don't
> think the fs itself gets compressed.

As I'm not a coder I can't actually tell you from reading the code, but 
AFAIK, both the 128 KiB compression block size and the checksum are on 
the uncompressed data.  Compression takes place after checksumming.

And I don't believe metadata, whether metadata itself or inline data, is 
compressed by btrfs' transparent compression.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-12  3:11                 ` Duncan
@ 2016-08-12  3:51                   ` Chris Murphy
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Murphy @ 2016-08-12  3:51 UTC (permalink / raw)
  To: Btrfs BTRFS

On Thu, Aug 11, 2016 at 9:11 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Chris Murphy posted on Thu, 11 Aug 2016 14:43:56 -0600 as excerpted:
>
>> On Thu, Aug 11, 2016 at 1:07 PM, Duncan <1i5t5.duncan@cox.net> wrote:
>>> The compression-related problem is this:  Btrfs is considerably less
>>> tolerant of checksum-related errors on btrfs-compressed data,
>>
>> Why? The data is the data. And why would it matter if it's application
>> compressed data vs Btrfs compressed data? If there's an error, Btrfs is
>> intolerant. I don't see how there's a checksum error that Btrfs
>> tolerates.
>
> Apparently, the code path for compressed data is sufficiently different,
> that when there's a burst of checksum errors, even on raid1 where it
> should (and does with scrub) get the correct second copy, it will crash
> the system.

Ahh OK, gotcha.

>  This is my experience and that of others, and what I thought
> was standard btrfs behavior -- I didn't know it was a compression-
> specific bug since I use compress on all my btrfs, until someone told me.
>
> When the btrfs compression option hasn't been used on that filesystem, or
> presumably when none of that burst of checksum errors is from btrfs-
> compressed files, it will grab the second copy and use it as it should,
> and there will be no crash.  This is as reported by others, including
> people who have tested both with and without btrfs-compressed files and
> found that it only crashed if the files were btrfs-compressed, whereas it
> worked as expected, fetching the valid second copy, if they weren't btrfs-
> compressed.

OK so something's broken.


>
> As I'm not a coder I can't actually tell you from reading the code, but
> AFAIK, both the 128 KiB compression block size and the checksum are on
> the uncompressed data.  Compression takes place after checksumming.
>
> And I don't believe metadata, whether metadata itself or inline data, is
> compressed by btrfs' transparent compression.

Inline data is definitely compressed.

>From ls -li
263 -rw-r-----. 1 root root  3270 Aug 11 21:29 samsung840-256g-hdparm.txt


>From btrfs-debug-tree

    item 84 key (263 INODE_ITEM 0) itemoff 7618 itemsize 160
        inode generation 7 transid 7 size 3270 nbytes 3270
        block group 0 mode 100640 links 1 uid 0 gid 0
        rdev 0 flags 0x0
    item 85 key (263 INODE_REF 256) itemoff 7582 itemsize 36
        inode ref index 8 namelen 26 name: samsung840-256g-hdparm.txt
    item 86 key (263 XATTR_ITEM 3817753667) itemoff 7499 itemsize 83
        location key (0 UNKNOWN.0 0) type XATTR
        namelen 16 datalen 37 name: security.selinux
        data unconfined_u:object_r:unlabeled_t:s0
    item 87 key (263 EXTENT_DATA 0) itemoff 5860 itemsize 1639
        inline extent data size 1618 ram 3270 compress(zlib)


Curiously though, these same small text files once above a certain
size (?) are not compressed if they aren't inline extents.

278 -rw-r-----. 1 root root 11767 Aug 11 21:29 WDCblack-750g-smartctlx_2.txt


    item 48 key (278 INODE_ITEM 0) itemoff 7675 itemsize 160
        inode generation 7 transid 7 size 11767 nbytes 12288
        block group 0 mode 100640 links 1 uid 0 gid 0
        rdev 0 flags 0x0
    item 49 key (278 INODE_REF 256) itemoff 7636 itemsize 39
        inode ref index 23 namelen 29 name: WDCblack-750g-smartctlx_2.txt
    item 50 key (278 XATTR_ITEM 3817753667) itemoff 7553 itemsize 83
        location key (0 UNKNOWN.0 0) type XATTR
        namelen 16 datalen 37 name: security.selinux
        data unconfined_u:object_r:unlabeled_t:s0
    item 51 key (278 EXTENT_DATA 0) itemoff 7500 itemsize 53
        extent data disk byte 12939264 nr 4096
        extent data offset 0 nr 12288 ram 12288
        extent compression(zlib)


Hrrmm.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-10 21:21   ` Chris Murphy
  2016-08-10 22:01     ` Dave T
@ 2016-08-12 17:00     ` Patrik Lundquist
  1 sibling, 0 replies; 28+ messages in thread
From: Patrik Lundquist @ 2016-08-12 17:00 UTC (permalink / raw)
  To: Btrfs BTRFS

On 10 August 2016 at 23:21, Chris Murphy <lists@colorremedies.com> wrote:
>
> I'm using LUKS, aes xts-plain64, on six devices. One is using mixed-bg
> single device. One is dsingle mdup. And then 2x2 mraid1 draid1. I've
> had zero problems. The two computers these run on do have aesni
> support. Aging wise, they're all at least a  year old. But I've been
> using Btrfs on LUKS for much longer than that.

FWIW:
I've had 5 spinning disks with LUKS + Btrfs raid1 for 1,5 years.
Also xts-plain64 with AES-NI acceleration.
No problems so far. Not using Btrfs compression.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-12 15:06   ` Duncan
@ 2016-08-15 11:33     ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2016-08-15 11:33 UTC (permalink / raw)
  To: linux-btrfs

On 2016-08-12 11:06, Duncan wrote:
> Austin S. Hemmelgarn posted on Fri, 12 Aug 2016 08:04:42 -0400 as
> excerpted:
>
>> On a file server?  No, I'd ensure proper physical security is
>> established and make sure it's properly secured against network based
>> attacks and then not worry about it.  Unless you have things you want to
>> hide from law enforcement or your government (which may or may not be
>> legal where you live) or can reasonably expect someone to steal the
>> system, you almost certainly don't actually need whole disk encryption.
>> There are two specific exceptions to this though:
>> 1. If your employer requires encryption on this system, that's their
>> call.
>> 2. Encrypted swap is a good thing regardless, because it prevents
>> security credentials from accidentally being written unencrypted to
>> persistent storage.
>
> In the US, medical records are pretty well protected under penalty of law
> (HIPPA, IIRC?).  Anyone storing medical records here would do well to
> have full filesystem encryption for that reason.
>
> Of course financial records are sensitive as well, or even just forum
> login information, and then there's the various industrial spies from
> various countries (China being the one most frequently named) that would
> pay good money for unencrypted devices from the right sources.
>
Medical and even financial records really fall under my first exception, 
but it's still no substitute for proper physical security.  As far as 
user account information, that depends on what your legal or PR 
department promised, but in many cases there, there's minimal 
improvement in security when using full disk encryption in place of just 
encrypting the database file used to store the information.

In either case though, it's still a better investment in terms of both 
time and money to properly secure the network and physical access to the 
hardware.  All that disk encryption protects is data at rest, and for a 
_server_ system, the data is almost always online, and therefore lack of 
protection of the system as a whole is usually more of a security issue 
in general than lack of protection for a single disk that's powered off.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-12 12:04 ` Austin S. Hemmelgarn
  2016-08-12 15:06   ` Duncan
@ 2016-08-12 17:02   ` Chris Murphy
  1 sibling, 0 replies; 28+ messages in thread
From: Chris Murphy @ 2016-08-12 17:02 UTC (permalink / raw)
  To: Btrfs BTRFS

On Fri, Aug 12, 2016 at 6:04 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-08-11 16:23, Dave T wrote:

>> 5. Would most of you guys use btrfs + dm-crypt on a production file
>> server (with spinning disks in JBOD configuration -- i.e., no RAID).
>> In this situation, the data is very important, of course. My past
>> experience indicated that RAID only improves uptime, which is not so
>> critical in our environment. Our main criteria is that we should never
>> ever have data loss. As far as I understand it, we do have to use
>> encryption.
>
> On a file server?  No, I'd ensure proper physical security is established
> and make sure it's properly secured against network based attacks and then
> not worry about it.  Unless you have things you want to hide from law
> enforcement or your government (which may or may not be legal where you
> live) or can reasonably expect someone to steal the system, you almost
> certainly don't actually need whole disk encryption.

Sure but then you need a fairly strict handling policy for those
drives when they leave the environment: e.g. for an RMA if the drive
dies under warranty, or when the drive is being retired. First there's
the actual physical handling (even interception) and accounting of all
of the drives, which has to be rather strict. And second, the fallback
to wiping the drive if it's dead must be physical destruction. For any
data not worth physically destroying the drive for proper disposal,
you can probably forego full disk encryption.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-12 12:04 ` Austin S. Hemmelgarn
@ 2016-08-12 15:06   ` Duncan
  2016-08-15 11:33     ` Austin S. Hemmelgarn
  2016-08-12 17:02   ` Chris Murphy
  1 sibling, 1 reply; 28+ messages in thread
From: Duncan @ 2016-08-12 15:06 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Fri, 12 Aug 2016 08:04:42 -0400 as
excerpted:

> On a file server?  No, I'd ensure proper physical security is
> established and make sure it's properly secured against network based
> attacks and then not worry about it.  Unless you have things you want to
> hide from law enforcement or your government (which may or may not be
> legal where you live) or can reasonably expect someone to steal the
> system, you almost certainly don't actually need whole disk encryption.
> There are two specific exceptions to this though:
> 1. If your employer requires encryption on this system, that's their
> call.
> 2. Encrypted swap is a good thing regardless, because it prevents
> security credentials from accidentally being written unencrypted to
> persistent storage.

In the US, medical records are pretty well protected under penalty of law 
(HIPPA, IIRC?).  Anyone storing medical records here would do well to 
have full filesystem encryption for that reason.

Of course financial records are sensitive as well, or even just forum 
login information, and then there's the various industrial spies from 
various countries (China being the one most frequently named) that would 
pay good money for unencrypted devices from the right sources.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-11 20:23 Dave T
  2016-08-12  4:13 ` Duncan
  2016-08-12  8:14 ` Adam Borowski
@ 2016-08-12 12:04 ` Austin S. Hemmelgarn
  2016-08-12 15:06   ` Duncan
  2016-08-12 17:02   ` Chris Murphy
  2 siblings, 2 replies; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2016-08-12 12:04 UTC (permalink / raw)
  To: Dave T, Duncan; +Cc: Nicholas D Steeves, Chris Murphy, Btrfs BTRFS

On 2016-08-11 16:23, Dave T wrote:
> What I have gathered so far is the following:
>
> 1. my RAM is not faulty and I feel comfortable ruling out a memory
> error as having anything to do with the reported problem.
>
> 2. my storage device does not seem to be faulty. I have not figured
> out how to do more definitive testing, but smartctl reports it as
> healthy.
Is this just based on smartctl -H, or is it based on looking at all the 
info available from smartctl?  Based on everything you've said so far, 
it sounds to me like there was a group of uncorrectable errors on the 
disk, and the sectors in question have now been remapped by the device's 
firmware.  Such a situation is actually more common than people think 
(this is part of the whole 'reinstall to speed up your system' mentality 
in the Windows world).  I've actually had this happen before (and 
correlated the occurrences with spikes in readings from the data-logging 
Geiger counter I have next to my home server).  Most disks don't start 
to report as failing until they get into pretty bad condition (on most 
hard drives, it takes a pretty insanely large count of reallocated 
sectors to mark the disk as failed in the drive firmware, and on SSD's 
you pretty much have to run it out of spare blocks (which takes a _long_ 
time on many SSD's)).
>
> 3. this problem first happened on a normally running system in light
> use. It had not recently crashed. But the root fs went read-only for
> an unknown reason.
>
> 4. the aftermath of the initial problem may have been exacerbated by
> hard resetting the system, but that's only a guess
>
>> The compression-related problem is this:  Btrfs is considerably less tolerant of checksum-related errors on btrfs-compressed data
>
> I'm an unsophisticated user. The argument in support of this statement
> sounds convincing to me. Therefore, I think I should discontinue using
> compression. Anyone disagree?
>
> Is there anything else I should change? (Do I need to provide
> additional information?)
>
> What can I do to find out more about what caused the initial problem.
> I have heard memory errors mentioned, but that's apparently not the
> case here. I have heard crash recovery mentioned, but that isn't how
> my problem initially happened.
>
> I also have a few general questions:
>
> 1. Can one discontinue using the compress mount option if it has been
> used previously? What happens to existing data if the compress mount
> option is 1) added when it wasn't used before, or 2) dropped when it
> had been used.
Yes, it just affects newly written data.  If you want to convert 
existing data to be uncompressed, you'll need to run 'btrfs filesystem 
defrag -r ' on the filesystem to convert things.
>
> 2. I understand that the compress option generally improves btrfs
> performance (via Phoronix article I read in the past; I don't find the
> link). Since encryption has some characteristics in common with
> compression, would one expect any decrease in performance from
> dropping compression when using btrfs on dm-crypt? (For more context,
> with an i7 6700K which has aes-ni, CPU performance should not be a
> bottleneck on my computer.)
I would expect a change in performance in that case, but not necessarily 
a decrease.  The biggest advantage of compression is that it trades time 
spent using the disk for time spent using the CPU.  In many cases, this 
is a favorable trade-off when your storage is slower than your memory 
(because memory speed is really the big limiting factor here, not 
processor speed).  In your case, the encryption is hardware accelerated, 
but the compression isn't, so you should in theory actually get better 
performance by turning off compression.
>
> 3. How do I find out if it is appropriate to use dup metadata on a
> Samsung 950 Pro NVMe drive? I don't see deduplication mentioned in the
> drive's datasheet:
> http://www.samsung.com/semiconductor/minisite/ssd/downloads/document/Samsung_SSD_950_PRO_Data_Sheet_Rev_1_2.pdf
Whether or not it does deduplication is hard to answer.  If it does, 
then you obviously should avoid dup metadata.  If it doesn't, then it's 
a complex question as to whether or not to use dup metadata.  The short 
explanation for why is that the SSD firmware maintains a somewhat 
arbitrary mapping between LBA's and actual location of the data in 
flash, and it tends to group writes from around the same time together 
in the flash itself.  The argument against dup on SSD's in general takes 
this into account, arguing that because the data is likely to be in the 
same erase block for both copies, it's not as well protected. 
Personally, I run dup on non-deduplicationg SSD's anyway, because I 
don't trust higher layers to not potentially mess up one of the copies, 
and I still get better performance than most hard disks.
>
> 4. Given that my drive is not reporting problems, does it seem
> reasonable to re-use this drive after the errors I reported? If so,
> how should I do that? Can I simply make a new btrfs filesystem and
> copy my data back? Should I start at a lower level and re-do the
> dm-crypt layer?
If it were me, I'd rebuild from the ground up just to be sure that 
everything is in a known working state.  That way you can be reasonably 
sure any issues are not left over from the previous configuration.
>
> 5. Would most of you guys use btrfs + dm-crypt on a production file
> server (with spinning disks in JBOD configuration -- i.e., no RAID).
> In this situation, the data is very important, of course. My past
> experience indicated that RAID only improves uptime, which is not so
> critical in our environment. Our main criteria is that we should never
> ever have data loss. As far as I understand it, we do have to use
> encryption.
On a file server?  No, I'd ensure proper physical security is 
established and make sure it's properly secured against network based 
attacks and then not worry about it.  Unless you have things you want to 
hide from law enforcement or your government (which may or may not be 
legal where you live) or can reasonably expect someone to steal the 
system, you almost certainly don't actually need whole disk encryption. 
There are two specific exceptions to this though:
1. If your employer requires encryption on this system, that's their call.
2. Encrypted swap is a good thing regardless, because it prevents 
security credentials from accidentally being written unencrypted to 
persistent storage.

On my personal systems, I only use encryption for swap space and 
security credentials, but I use file based encryption for the 
credentials.  I also don't store any data that needs absolute protection 
against people stealing it though (other than the security credentials, 
but I can remotely deauthorize any of those with minimal effort), so 
there's not much advantage for me as a user to using disk encryption.

Things are pretty similar at work, except the reasoning there is that we 
have good network protection, and restricted access to the server room, 
so there's no way realistically without causing significant amounts of 
damage elsewhere that the data could be stolen (although we're in a 
small enough industry that the only people likely to want to steal our 
data is our competitors, and they don't have the funding to pull off 
industrial espionage).

Now, as far as RAID, I don't entirely agree about it just improving 
up-time.  That's one of the big advantages, but it's not the only one. 
Having a system that will survive a disk failure and keep working is 
good for other reasons too:
1. It makes it less immediately critical that things be dealt with (for 
example, if a disk fails in the middle of the night, you can often wait 
until the next morning to deal with it).
2. When done right with a system that supports hot-swap properly (all 
server systems these days should), it allows for much simpler and much 
safer storage device upgrades.
3. It makes it easier (when done with BTRFS or LVM) to re-provision 
storage space without having to take the system off-line.
I could have almost any of the Linux servers at work back up and running 
correctly from a backup in about 15 minutes, but I still have them set 
up with RAID-1 because it lets me do things like install bigger storage 
devices with minimal chance of data loss.  As for my personal systems, 
my home server is set up with RAID in such a way that I can lose 3 of 
the 4 hard drives and 1 of the 2 SSD's and still not need to restore 
from backup (and still have a working system).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-11 20:23 Dave T
  2016-08-12  4:13 ` Duncan
@ 2016-08-12  8:14 ` Adam Borowski
  2016-08-12 12:04 ` Austin S. Hemmelgarn
  2 siblings, 0 replies; 28+ messages in thread
From: Adam Borowski @ 2016-08-12  8:14 UTC (permalink / raw)
  To: Dave T; +Cc: Btrfs BTRFS

On Thu, Aug 11, 2016 at 04:23:45PM -0400, Dave T wrote:
> 1. Can one discontinue using the compress mount option if it has been
> used previously?

The mount option applies only to newly written blocks, and even then only to
files that don't say otherwise (via chattr +c or +C, btrfs property, etc).
You can change it on the fly (mount -o remount,...), etc.

> What happens to existing data if the compress mount option is 1) added
> when it wasn't used before, or 2) dropped when it had been used.

That data stays compressed or uncompressed, as when it was written.  You can
defrag them to change that; balance moves extents without changing their
compression.

> 2. I understand that the compress option generally improves btrfs
> performance (via Phoronix article I read in the past; I don't find the
> link). Since encryption has some characteristics in common with
> compression, would one expect any decrease in performance from
> dropping compression when using btrfs on dm-crypt? (For more context,
> with an i7 6700K which has aes-ni, CPU performance should not be a
> bottleneck on my computer.)

As said elsewhere, compression can drastically help or reduce performance,
this depends on your CPU-to-IO ratio, and to whether you do small random
writes inside files (compress has to rewrite a whole 128KB block).

An extreme data point: Odroid-U2 on eMMC doing Debian archive rebuilds,
compression improves overall throughput by a factor of around two!  On the
other hand, this same task on typical machines tends to be CPU bound.

-- 
An imaginary friend squared is a real enemy.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
  2016-08-11 20:23 Dave T
@ 2016-08-12  4:13 ` Duncan
  2016-08-12  8:14 ` Adam Borowski
  2016-08-12 12:04 ` Austin S. Hemmelgarn
  2 siblings, 0 replies; 28+ messages in thread
From: Duncan @ 2016-08-12  4:13 UTC (permalink / raw)
  To: linux-btrfs

Dave T posted on Thu, 11 Aug 2016 16:23:45 -0400 as excerpted:

> I also have a few general questions:
> 
> 1. Can one discontinue using the compress mount option if it has been
> used previously? What happens to existing data if the compress mount
> option is 1) added when it wasn't used before, or 2) dropped when it had
> been used.

The compress mount option only affects newly written data.  Data that was 
previously written is automatically decompressed into memory on read, 
regardless of whether the compress option is still being used or not.

So you can freely switch between using the option and not, and it'll only 
affect newly written files.  Existing files stay written the way they 
are, unless you do something (like run a recursive defrag with the 
compress option) to rewrite them.

> 2. I understand that the compress option generally improves btrfs
> performance (via Phoronix article I read in the past; I don't find the
> link). Since encryption has some characteristics in common with
> compression, would one expect any decrease in performance from dropping
> compression when using btrfs on dm-crypt? (For more context,
> with an i7 6700K which has aes-ni, CPU performance should not be a
> bottleneck on my computer.)

Compression performance works like this (this is a general rule, not 
btrfs specific):  Compression uses more CPU cycles but results in less 
data to actually transfer to and from storage.  If your disks are slow 
and your CPU is fast (or if the CPU can use hardware accelerated 
compression functions), performance will tend to favor compression, 
because the bottleneck will be the actual data transfer to and from 
storage and the extra overhead of the CPU cycles won't normally matter 
while the effect of less data to actually transfer, due to the 
compression, will.

But the slower the CPU (and lack of hardware accelerated compression 
functions) is and the faster storage IO is, the less of a bottleneck the 
actual data transfer will be, and thus the more likely it will be that 
the CPU will become the bottleneck, particularly as the compression gets 
more efficient size-wise, which generally translates to requiring more 
CPU cycles and/or memory to handle it.

Since your storage is PCIE-3.0 @ > 1 GiB/sec, extremely fast, even tho LZO 
compression is considered fast (as opposed to size-efficient) as well, 
you may actually see /better/ performance without compression, especially 
when running CPU-heavy workloads where the extra CPU cycles of 
compression will matter as the CPU is already the bottleneck.

Since you're doing encryption also, and that too tends to be CPU 
intensive (even if it's hardware accelerated for you), I'd actually be a 
bit surprised if you didn't see an increase of performance without 
compression, because your storage /is/ so incredibly fast compared to 
conventional storage.

But of course if it's really a concern, there's nothing like actually 
benchmarking it yourself to see. =:^)  But I'd be very surprised if you 
actually notice a slowdown, turning compression off.  You might not 
notice a performance boost either, but I'd be surprised if you notice a 
slowdown, tho some artificial benchmarks might show one if they aren't 
balancing CPU and IO in something like real-world.

> 3. How do I find out if it is appropriate to use dup metadata on a
> Samsung 950 Pro NVMe drive? I don't see deduplication mentioned in the
> drive's datasheet:
> http://www.samsung.com/semiconductor/minisite/ssd/downloads/document/
Samsung_SSD_950_PRO_Data_Sheet_Rev_1_2.pdf

I'd google the controller.  A lot of them will list either compression 
and dedup as features as they enhance performance in some cases, or the 
stability of constant performance as a feature, as mine, targeted at the 
server market, did.  If the emphasis is on constant performance and what-
you-see-is-what-you-get storage capacity, then they're not doing 
compression and dedup, as that can increase performance and storage 
capacity under certain conditions, but it's very unpredictable as it 
depends on how much duplication the data has and how compressible it is.

Sandforce controllers, in particular, are known to emphasize compression 
and dedup.  OTOH, controllers targeted at enterprise or servers are 
likely to emphasize stability and predictability and thus not do 
transparent compression or dedup.

> 4. Given that my drive is not reporting problems, does it seem
> reasonable to re-use this drive after the errors I reported? If so,
> how should I do that? Can I simply make a new btrfs filesystem and copy
> my data back? Should I start at a lower level and re-do the dm-crypt
> layer?

I'd reuse it here.  For hardware that supports/needs trim I'd start at 
the bottom layer and work up, but IIRC you said yours doesn't need it, 
and by the time you get to the btrfs layer on top of the crypt layer, the 
hardware layer should be scrambled zeros and ones in any case, so if it's 
true your hardware doesn't need it, I'd guess you should be fine just 
doing the mkfs on top of the existing dmcrypted layer.

But I don't use a crypted layer here, so better to rely on others with 
experience with it, if you have their answers to rely on.

> 5. Would most of you guys use btrfs + dm-crypt on a production file
> server (with spinning disks in JBOD configuration -- i.e., no RAID).
> In this situation, the data is very important, of course. My past
> experience indicated that RAID only improves uptime, which is not so
> critical in our environment. Our main criteria is that we should never
> ever have data loss. As far as I understand it, we do have to use
> encryption.

I'd suggest, if the data is that important, do btrfs raid1.  Because 
unlike most raid, btrfs raid takes advantage of btrfs checksumming, and 
actually gives you a second copy to fall back on as well as to repair a 
bad copy, if the first copy tried fails the checksum test.  That level of 
run-time-verified data integrity and repair is something most raid 
systems don't have -- they'll only use the parity or redundancy to verify 
integrity if a device fails or if a scrub is done (and even with a scrub, 
in most cases at least for redundant-raid they simply blindly copy the 
one device to the others, no real integrity checking at all).  But 
because btrfs raid1 actually does that real-time integrity checking and 
repair, it's a lot stronger in use-cases where data integrity is 
paramount.

Tho do note that btrfs raid1 is ONLY two-copy, additional devices 
increase capacity, not redundancy.  So I'd create two crypted devices of 
roughly the same size out of your JBOD, and expose them to btrfs to use 
as a raid1.

Or if you want a cold-spare, create three crypted devices of about the 
same size, create a btrfs raid1 out of two of them, and keep the third in 
reserve to btrfs replace, if needed.

Tho as i said earlier, I don't personally trust btrfs on the crypted 
layer yet, so for me, I'd either use something other than btrfs, or use 
btrfs but really emphasize the backups, including testing them of course, 
because I /don't/ really trust btrfs on crypted just yet.  But based on 
earlier posts in this thread, I admit it's very possible that all the 
reported cases that are the basis for my not trusting btrfs on dmcrypt 
yet, were using btrfs compression, and it's possible /that/ was the real 
problem, and without it, things will be fine.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: checksum error in metadata node - best way to move root fs to new drive?
@ 2016-08-11 20:23 Dave T
  2016-08-12  4:13 ` Duncan
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Dave T @ 2016-08-11 20:23 UTC (permalink / raw)
  To: Duncan
  Cc: Nicholas D Steeves, Chris Murphy, Btrfs BTRFS, Austin S. Hemmelgarn

What I have gathered so far is the following:

1. my RAM is not faulty and I feel comfortable ruling out a memory
error as having anything to do with the reported problem.

2. my storage device does not seem to be faulty. I have not figured
out how to do more definitive testing, but smartctl reports it as
healthy.

3. this problem first happened on a normally running system in light
use. It had not recently crashed. But the root fs went read-only for
an unknown reason.

4. the aftermath of the initial problem may have been exacerbated by
hard resetting the system, but that's only a guess

> The compression-related problem is this:  Btrfs is considerably less tolerant of checksum-related errors on btrfs-compressed data

I'm an unsophisticated user. The argument in support of this statement
sounds convincing to me. Therefore, I think I should discontinue using
compression. Anyone disagree?

Is there anything else I should change? (Do I need to provide
additional information?)

What can I do to find out more about what caused the initial problem.
I have heard memory errors mentioned, but that's apparently not the
case here. I have heard crash recovery mentioned, but that isn't how
my problem initially happened.

I also have a few general questions:

1. Can one discontinue using the compress mount option if it has been
used previously? What happens to existing data if the compress mount
option is 1) added when it wasn't used before, or 2) dropped when it
had been used.

2. I understand that the compress option generally improves btrfs
performance (via Phoronix article I read in the past; I don't find the
link). Since encryption has some characteristics in common with
compression, would one expect any decrease in performance from
dropping compression when using btrfs on dm-crypt? (For more context,
with an i7 6700K which has aes-ni, CPU performance should not be a
bottleneck on my computer.)

3. How do I find out if it is appropriate to use dup metadata on a
Samsung 950 Pro NVMe drive? I don't see deduplication mentioned in the
drive's datasheet:
http://www.samsung.com/semiconductor/minisite/ssd/downloads/document/Samsung_SSD_950_PRO_Data_Sheet_Rev_1_2.pdf

4. Given that my drive is not reporting problems, does it seem
reasonable to re-use this drive after the errors I reported? If so,
how should I do that? Can I simply make a new btrfs filesystem and
copy my data back? Should I start at a lower level and re-do the
dm-crypt layer?

5. Would most of you guys use btrfs + dm-crypt on a production file
server (with spinning disks in JBOD configuration -- i.e., no RAID).
In this situation, the data is very important, of course. My past
experience indicated that RAID only improves uptime, which is not so
critical in our environment. Our main criteria is that we should never
ever have data loss. As far as I understand it, we do have to use
encryption.

Thanks for the discussion so far. It's very educational for me.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-08-15 11:33 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-10  3:27 checksum error in metadata node - best way to move root fs to new drive? Dave T
2016-08-10  6:27 ` Duncan
2016-08-10 19:46   ` Austin S. Hemmelgarn
2016-08-10 21:21   ` Chris Murphy
2016-08-10 22:01     ` Dave T
2016-08-10 22:23       ` Chris Murphy
2016-08-10 22:52         ` Dave T
2016-08-11 14:12           ` Nicholas D Steeves
2016-08-11 14:45             ` Austin S. Hemmelgarn
2016-08-11 19:07             ` Duncan
2016-08-11 20:43               ` Chris Murphy
2016-08-12  3:11                 ` Duncan
2016-08-12  3:51                   ` Chris Murphy
2016-08-11 20:33             ` Chris Murphy
2016-08-11  7:18         ` Andrei Borzenkov
2016-08-11  4:50       ` Duncan
2016-08-11  5:06         ` Gareth Pye
2016-08-11  8:20           ` Duncan
2016-08-12 17:00     ` Patrik Lundquist
2016-08-10 21:15 ` Chris Murphy
2016-08-10 22:50   ` Dave T
2016-08-11 20:23 Dave T
2016-08-12  4:13 ` Duncan
2016-08-12  8:14 ` Adam Borowski
2016-08-12 12:04 ` Austin S. Hemmelgarn
2016-08-12 15:06   ` Duncan
2016-08-15 11:33     ` Austin S. Hemmelgarn
2016-08-12 17:02   ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.