* checksum error in metadata node - best way to move root fs to new drive?
@ 2016-08-10 3:27 Dave T
2016-08-10 6:27 ` Duncan
2016-08-10 21:15 ` Chris Murphy
0 siblings, 2 replies; 28+ messages in thread
From: Dave T @ 2016-08-10 3:27 UTC (permalink / raw)
To: linux-btrfs
btrfs scrub returned with uncorrectable errors. Searching in dmesg
returns the following information:
BTRFS warning (device dm-0): checksum error at logical NNNNN on
/dev/mapper/[crypto] sector: yyyyy metadata node (level 2) in tree 250
it also says:
unable to fixup (regular) error at logical NNNNNN on /dev/mapper/[crypto]
I assume I have a bad block device. Does that seem correct? The
important data is backed up.
However, it would save me a lot of time reinstalling the operating
system and setting up my work environment if I can copy this root
filesystem to another storage device.
Can I do that, considering the errors I have mentioned?? With the
uncorrectable error being in a metadata node, what (if anything) does
that imply about restoring from this drive?
If I can copy this entire root filesystem, what is the best way to do
it? The btrfs restore tool? cp? rsync? Some cloning tool? Other
options?
If I use the btrfs restore tool, should I use options x, m and S? In
particular I wonder exactly what the S option does. If I leave S out,
are all symlinks ignored?
I'm trying to save time and clone this so that I get the operating
system and all my tweaks / configurations back. As I said, the really
important data is separately backed up.
I appreciate all suggestions.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-10 3:27 checksum error in metadata node - best way to move root fs to new drive? Dave T
@ 2016-08-10 6:27 ` Duncan
2016-08-10 19:46 ` Austin S. Hemmelgarn
2016-08-10 21:21 ` Chris Murphy
2016-08-10 21:15 ` Chris Murphy
1 sibling, 2 replies; 28+ messages in thread
From: Duncan @ 2016-08-10 6:27 UTC (permalink / raw)
To: linux-btrfs
Dave T posted on Tue, 09 Aug 2016 23:27:56 -0400 as excerpted:
> btrfs scrub returned with uncorrectable errors. Searching in dmesg
> returns the following information:
>
> BTRFS warning (device dm-0): checksum error at logical NNNNN on
> /dev/mapper/[crypto] sector: yyyyy metadata node (level 2) in tree 250
>
> it also says:
>
> unable to fixup (regular) error at logical NNNNNN on
> /dev/mapper/[crypto]
>
>
> I assume I have a bad block device. Does that seem correct? The
> important data is backed up.
>
> However, it would save me a lot of time reinstalling the operating
> system and setting up my work environment if I can copy this root
> filesystem to another storage device.
>
> Can I do that, considering the errors I have mentioned?? With the
> uncorrectable error being in a metadata node, what (if anything) does
> that imply about restoring from this drive?
Well, given that I don't see any other people more qualified than I, as a
simple btrfs user and list regular, tho not a dmcrypt user and definitely
not a btrfs dev, posting, I'll try to help, but...
Do you know what data and metadata replication modes you were using?
Scrub detects checksum errors, and for raid1 mode on multi-device (but I
guess you were single device) and dup mode on single device, it will try
the other copy and use it if the checksum passes there, repairing the bad
copy as well.
But until recently dup mode data on single device was impossible, so I
doubt you were using that, and while dup mode metadata was the normal
default, on ssd that changes to single mode as well.
Which means if you were using ssd defaults, you got single mode for both
data and metadata, and scrub can detect but not correct checksum errors.
That doesn't directly answer your question, but it does explain why/that
you couldn't /expect/ scrub to fix checksum problems, only detect them,
if both data and metadata are single mode.
Meanwhile, in a different post you asked about btrfs on dmcrypt. I'm not
aware of any direct btrfs-on-dmcrypt specific bugs (tho I'm just a btrfs
user and list regular, not a dev, so could have missed something), but
certainly, the dmcrypt layer doesn't simplify things. There was a guy
here, Mark MERLIN, worked for google I believe and was on the road
frequently, that was using btrfs on dmcrypt for his laptop and various
btrfs on his servers as well -- he wrote some of the raid56 mode stuff on
the wiki based on his own experiments with it. But I haven't seen him
around recently. I'd suggest he'd be the guy to talk to about btrfs on
dmcrypt if you can get in contact with him, as he seemed to have more
experience with it than anyone else around here. But like I said I
haven't seen him around recently...
Put it this way. If it were my data on the line, I'd either (1) use
another filesystem on top of dmcrypt, if I really wanted/needed the
crypted layer, or (2) do without the crypted layer, or (3) use btrfs but
be extra vigilant with backups. This since while I know of no specific
bugs in btrfs-on-dmcrypt case, I don't particularly trust it either, and
Marc MERLIN's posted troubles with the combo were enough to have me
avoiding it if possible, and being extra careful with backups if not.
> If I can copy this entire root filesystem, what is the best way to do
> it? The btrfs restore tool? cp? rsync? Some cloning tool? Other options?
It depends on if the filesystem is mountable and if so, how much can be
retrieved without error, the latter of which depends on the extent of
that metadata damage, since damaged metadata will likely take out
multiple files, and depending on what level of the tree the damage was
on, it could take out only a few files, or most of the filesystem!
If you can mount and the damage appears to be limited, I'd try mounting
read-only and copying what I could off, using conventional methods. That
way you get checksum protection, which should help assure that anything
successfully copied isn't corrupted, because btrfs will error out if
there's checksum errors and it won't copy successfully.
If it won't mount or it will but the damage appears to be extensive, I'd
suggest using restore. It's read-only in terms of the filesystem it's
restoring from, so shouldn't cause further damage -- unless the device is
actively decaying as you use it, in which case the first thing I'd try to
do is image it to something else so the damage isn't getting worse as you
work with it.
But AFAIK restore doesn't give you the checksum protection, so anything
restored that way /could/ be corrupt (tho it's worth noting that ordinary
filesystems don't do checksum protection anyway, so it's important not to
consider the file any more damaged just because it wasn't checksum
protected than it would be if you simply retrieved it from say an ext4
filesystem and didn't have some other method to verify the file).
Altho... working on dmcrypt, I suppose it's likely that anything that's
corrupted turns up entirely scrambled and useless anyway -- you may not
be able to retrieve for example a video file with some dropouts as may be
the case on unencrypted storage, but have a totally scrambled and useless
file, or at least that file block (4K), instead.
> If I use the btrfs restore tool, should I use options x, m and S? In
> particular I wonder exactly what the S option does. If I leave S out,
> are all symlinks ignored?
Symlinks are not restored without -S, correct. That and -m are both
relatively new restore options -- back when I first used restore you
simply didn't get that back.
If it's primarily just data files and you don't really care about
ownership/permissions or date metadata, you can leave the -m off to
simplify the process slightly. In that case, the files will be written
just as any other new file would be written, as the user (root) the app
is running as, subject to the current umask. Else use the -m and restore
will try to restore ownership/permissions/dates metadata as well.
Similarly, you may or may not need -x for the extended attributes.
Unless you're using selinux and its security attributes, or capacities to
avoid running as superuser (and those both apply primarily to
executables), chances are fairly good that unless you specifically know
you need extended attributes restored, you don't, and can skip that
option.
> I'm trying to save time and clone this so that I get the operating
> system and all my tweaks / configurations back. As I said, the really
> important data is separately backed up.
Good. =:^)
Sounds about like me. I do periodic backups, but have run restore a
couple times when a filesystem wouldn't mount, in ordered to get back as
much of the delta between the last backup and current as possible. Of
course I know not doing more frequent backups is a calculated risk and I
was prepared to have to redo anything changed since the backup if
necessary, but it's nice to have a tool like btrfs restore that can make
it unnecessary under certain conditions where it otherwise would be. =:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-10 6:27 ` Duncan
@ 2016-08-10 19:46 ` Austin S. Hemmelgarn
2016-08-10 21:21 ` Chris Murphy
1 sibling, 0 replies; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2016-08-10 19:46 UTC (permalink / raw)
To: linux-btrfs
On 2016-08-10 02:27, Duncan wrote:
> Dave T posted on Tue, 09 Aug 2016 23:27:56 -0400 as excerpted:
>
>> btrfs scrub returned with uncorrectable errors. Searching in dmesg
>> returns the following information:
>>
>> BTRFS warning (device dm-0): checksum error at logical NNNNN on
>> /dev/mapper/[crypto] sector: yyyyy metadata node (level 2) in tree 250
>>
>> it also says:
>>
>> unable to fixup (regular) error at logical NNNNNN on
>> /dev/mapper/[crypto]
>>
>>
>> I assume I have a bad block device. Does that seem correct? The
>> important data is backed up.
>>
>> However, it would save me a lot of time reinstalling the operating
>> system and setting up my work environment if I can copy this root
>> filesystem to another storage device.
>>
>> Can I do that, considering the errors I have mentioned?? With the
>> uncorrectable error being in a metadata node, what (if anything) does
>> that imply about restoring from this drive?
>
> Well, given that I don't see any other people more qualified than I, as a
> simple btrfs user and list regular, tho not a dmcrypt user and definitely
> not a btrfs dev, posting, I'll try to help, but...
I probably would have replied, if I had seen the e-mail before now.
GMail apparently really hates me recently, as I keep getting things
hours to days after other people and regularly out of order...
As usual though, you seem to have already covered everything important
pretty well, I've only got a few comments to add below.
>
> Do you know what data and metadata replication modes you were using?
> Scrub detects checksum errors, and for raid1 mode on multi-device (but I
> guess you were single device) and dup mode on single device, it will try
> the other copy and use it if the checksum passes there, repairing the bad
> copy as well.
>
> But until recently dup mode data on single device was impossible, so I
> doubt you were using that, and while dup mode metadata was the normal
> default, on ssd that changes to single mode as well.
>
> Which means if you were using ssd defaults, you got single mode for both
> data and metadata, and scrub can detect but not correct checksum errors.
>
> That doesn't directly answer your question, but it does explain why/that
> you couldn't /expect/ scrub to fix checksum problems, only detect them,
> if both data and metadata are single mode.
>
> Meanwhile, in a different post you asked about btrfs on dmcrypt. I'm not
> aware of any direct btrfs-on-dmcrypt specific bugs (tho I'm just a btrfs
> user and list regular, not a dev, so could have missed something), but
> certainly, the dmcrypt layer doesn't simplify things. There was a guy
> here, Mark MERLIN, worked for google I believe and was on the road
> frequently, that was using btrfs on dmcrypt for his laptop and various
> btrfs on his servers as well -- he wrote some of the raid56 mode stuff on
> the wiki based on his own experiments with it. But I haven't seen him
> around recently. I'd suggest he'd be the guy to talk to about btrfs on
> dmcrypt if you can get in contact with him, as he seemed to have more
> experience with it than anyone else around here. But like I said I
> haven't seen him around recently...
>
> Put it this way. If it were my data on the line, I'd either (1) use
> another filesystem on top of dmcrypt, if I really wanted/needed the
> crypted layer, or (2) do without the crypted layer, or (3) use btrfs but
> be extra vigilant with backups. This since while I know of no specific
> bugs in btrfs-on-dmcrypt case, I don't particularly trust it either, and
> Marc MERLIN's posted troubles with the combo were enough to have me
> avoiding it if possible, and being extra careful with backups if not.
As far as dm-crypt goes, it looks like BTRFS is stable on top in the
configuration I use (aex-xts-plain64 with a long key using plain
dm-crypt instead of LUKS). I have heard rumors of issues when using
LUKS without hardware acceleration, but I've never seen any conclusive
proof, and what little I've heard sounds more like it was just race
conditions elsewhere causing the issues.
>
>> If I can copy this entire root filesystem, what is the best way to do
>> it? The btrfs restore tool? cp? rsync? Some cloning tool? Other options?
>
> It depends on if the filesystem is mountable and if so, how much can be
> retrieved without error, the latter of which depends on the extent of
> that metadata damage, since damaged metadata will likely take out
> multiple files, and depending on what level of the tree the damage was
> on, it could take out only a few files, or most of the filesystem!
>
> If you can mount and the damage appears to be limited, I'd try mounting
> read-only and copying what I could off, using conventional methods. That
> way you get checksum protection, which should help assure that anything
> successfully copied isn't corrupted, because btrfs will error out if
> there's checksum errors and it won't copy successfully.
>
> If it won't mount or it will but the damage appears to be extensive, I'd
> suggest using restore. It's read-only in terms of the filesystem it's
> restoring from, so shouldn't cause further damage -- unless the device is
> actively decaying as you use it, in which case the first thing I'd try to
> do is image it to something else so the damage isn't getting worse as you
> work with it.
>
> But AFAIK restore doesn't give you the checksum protection, so anything
> restored that way /could/ be corrupt (tho it's worth noting that ordinary
> filesystems don't do checksum protection anyway, so it's important not to
> consider the file any more damaged just because it wasn't checksum
> protected than it would be if you simply retrieved it from say an ext4
> filesystem and didn't have some other method to verify the file).
>
> Altho... working on dmcrypt, I suppose it's likely that anything that's
> corrupted turns up entirely scrambled and useless anyway -- you may not
> be able to retrieve for example a video file with some dropouts as may be
> the case on unencrypted storage, but have a totally scrambled and useless
> file, or at least that file block (4K), instead.
This may or may not be the case, it really depends on how dm-crypt is
set up, and a bunch of other factors. The chance of this happening is
higher with dm-crypt, but it's still not a certainty.
>
>> If I use the btrfs restore tool, should I use options x, m and S? In
>> particular I wonder exactly what the S option does. If I leave S out,
>> are all symlinks ignored?
>
> Symlinks are not restored without -S, correct. That and -m are both
> relatively new restore options -- back when I first used restore you
> simply didn't get that back.
>
> If it's primarily just data files and you don't really care about
> ownership/permissions or date metadata, you can leave the -m off to
> simplify the process slightly. In that case, the files will be written
> just as any other new file would be written, as the user (root) the app
> is running as, subject to the current umask. Else use the -m and restore
> will try to restore ownership/permissions/dates metadata as well.
>
> Similarly, you may or may not need -x for the extended attributes.
> Unless you're using selinux and its security attributes, or capacities to
> avoid running as superuser (and those both apply primarily to
> executables), chances are fairly good that unless you specifically know
> you need extended attributes restored, you don't, and can skip that
> option.
There are a few other cases where they are important, but most of them
are big data-center type things. The big one I can think of off the top
of my head is when using GlusterFS on top of BTRFS, as Gluster stores
synchronization info in xattrs. I'm pretty certain Ceph does too. In
general though, if it's just a workstation, you probably don't need
xattrs unless you use a security module (like SELinux, IMA, or EVM),
file capabilities (ping almost certainly does on your system, but I
doubt anything else does, and ping won't break without them), or are
using ACL's (or Samba, it stores Windows style ACE's in xattrs, but it
doesn't do so by default, and setting that up right is complicated).
If you can afford to wait a bit longer, it's probably better to use -x,
because most of the things that break in the face of missing xattrs tend
to break rather spectacularly.
>
>> I'm trying to save time and clone this so that I get the operating
>> system and all my tweaks / configurations back. As I said, the really
>> important data is separately backed up.
>
> Good. =:^)
>
> Sounds about like me. I do periodic backups, but have run restore a
> couple times when a filesystem wouldn't mount, in ordered to get back as
> much of the delta between the last backup and current as possible. Of
> course I know not doing more frequent backups is a calculated risk and I
> was prepared to have to redo anything changed since the backup if
> necessary, but it's nice to have a tool like btrfs restore that can make
> it unnecessary under certain conditions where it otherwise would be. =:^)
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-10 3:27 checksum error in metadata node - best way to move root fs to new drive? Dave T
2016-08-10 6:27 ` Duncan
@ 2016-08-10 21:15 ` Chris Murphy
2016-08-10 22:50 ` Dave T
1 sibling, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2016-08-10 21:15 UTC (permalink / raw)
To: Dave T; +Cc: Btrfs BTRFS
On Tue, Aug 9, 2016 at 9:27 PM, Dave T <davestechshop@gmail.com> wrote:
> btrfs scrub returned with uncorrectable errors. Searching in dmesg
> returns the following information:
>
> BTRFS warning (device dm-0): checksum error at logical NNNNN on
> /dev/mapper/[crypto] sector: yyyyy metadata node (level 2) in tree 250
>
> it also says:
>
> unable to fixup (regular) error at logical NNNNNN on /dev/mapper/[crypto]
>
>
> I assume I have a bad block device. Does that seem correct? The
> important data is backed up.
If it were persistently, blatantly bad, then the drive firmware would
know about it, and would report a read error. If you're not seeing
libata UNC errors, or the other way it manifests is with hard link
resets due to inappropriate SCSI command timer default in the kernel,
then it's probably some kind of SDC, torn or misdirected write, etc.
If metadata is profile DUP, then scrub should fix it. If it's not,
there's something else going on (or really bad luck).
I'd like to believe that btrfs check can, or someday will, be able to
do some kind of sanity check on a node that fails checksum, and fix
it. If the node can be read but merely fails checksum isn't a really
good reason for a file system to not give you access to its data, but
yeah it kinda depends on what's in the node. It could contain up to a
couple hundred items each of which point elsewhere.
btrfs-debug-tree -b <block number reported by error at logical> <dev>
might give some hint what's going on. I'd like to believe it'll be
noisy and warn the checksum fails but still show the contents assuming
the drive hands over the data on those sectors.
> If I can copy this entire root filesystem, what is the best way to do
> it? The btrfs restore tool? cp? rsync? Some cloning tool? Other
> options?
0. Backup, that's done.
1. Report 'btrfs check' without --repair, let's see what it complains
about and if it might be able to plausibly fix this.
Since you can scrub, it means the file system mounts. Since the file
system mounts, I would not look at restore to start out because it's
tedious. I'd say you toss a coin over using btrfs send/receive, or
btrfs check --repair to see if it fixes the node. These days it should
be safe with relatively recent btrfs-progs so I'd say use a 4.6.x or
4.7 progs for this. And then the send/receive should be done with -v
or maybe even -vv for both send and receive, along with --max-errors
0, which will permit unlimited errors but will report them rather than
failing midstream. This will get you the bulk of the OS.
If you're lucky, the node contains only a handful of relatively
unimportant items, especially if they're files small enough to be
stored inline the node, which will substantially reduce the number of
errors as a result of a single node loss.
The calculus on btrfs check --repair first then send receive, vs
send/receive then if that fails fallback to btrfs check --repair, is
mainly time. Maybe repair can fix it, maybe it makes things worse.
Where send/receive might fail midstream without the node being fixed
first, but it causes no additional problems. The 2nd is more
conservative but takes more time if it turns out the send/receive
fails, you then do repair, and then have to start the send/receive
over from scratch again. (If it fails, you should delete or rename the
bad subvolume on the receive side before starting another send).
> If I use the btrfs restore tool, should I use options x, m and S? In
> particular I wonder exactly what the S option does. If I leave S out,
> are all symlinks ignored?
I would only use restore for the files that are reported by
send/receive as failed due to errors - assuming that even happens. Or
since this is OS stuff, just reinstall the packages for the files
affected by the bad node.
--
Chris Murphy
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-10 6:27 ` Duncan
2016-08-10 19:46 ` Austin S. Hemmelgarn
@ 2016-08-10 21:21 ` Chris Murphy
2016-08-10 22:01 ` Dave T
2016-08-12 17:00 ` Patrik Lundquist
1 sibling, 2 replies; 28+ messages in thread
From: Chris Murphy @ 2016-08-10 21:21 UTC (permalink / raw)
Cc: Btrfs BTRFS
I'm using LUKS, aes xts-plain64, on six devices. One is using mixed-bg
single device. One is dsingle mdup. And then 2x2 mraid1 draid1. I've
had zero problems. The two computers these run on do have aesni
support. Aging wise, they're all at least a year old. But I've been
using Btrfs on LUKS for much longer than that.
Chris Murphy
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-10 21:21 ` Chris Murphy
@ 2016-08-10 22:01 ` Dave T
2016-08-10 22:23 ` Chris Murphy
2016-08-11 4:50 ` Duncan
2016-08-12 17:00 ` Patrik Lundquist
1 sibling, 2 replies; 28+ messages in thread
From: Dave T @ 2016-08-10 22:01 UTC (permalink / raw)
To: Chris Murphy, Duncan, ahferroin7; +Cc: Btrfs BTRFS
Thanks for all the responses, guys! I really appreciate it. This
information is very helpful. I will be working through the suggestions
(e.g., check without repair) for the next hour or so. I'll report back
when I have something to report.
My drive is a Samsung 950 Pro nvme drive, which in most respects is
treated like an SSD. (the only difference I am aware of is that trim
isn't needed).
> But until recently dup mode data on single device was impossible, so I
> doubt you were using that, and while dup mode metadata was the normal
> default, on ssd that changes to single mode as well.
Your assumptions are correct: single mode for data and metadata.
Does anyone have any thoughts about using dup mode for metadata on a
Samsung 950 Pro (or any NVMe drive)?
I will be very disappointed if I cannot use btrfs + dm-crypt. As far
as I can see, there is no alternative given that I need to use
snapshots (and LVM, as good as it is, has severe performance penalties
for its snapshots). I'm required to use crypto. I cannot risk doing
without snapshots. Therefore, btrfs + dm-crypt seem like my only
viable solution. Plus it is my preferred solution. I like both tools.
If all goes well, we are planning to implement a production file
server for our office with dm-crypt + btrfs (and a lot fo spinning
disks).
In the office we currently have another system identical to mine
running the same drive with dm-crypt + btrfs, the same operating
system, the same nvidia GPU and properitary driver and it is running
fine. One difference is that it is overclocked substantially (mine
isn't). I would have expected it would give a problem before mine
would. But it seems to be rock solid. I just ran btrfs scrub on it and
it finished in a few seconds with no errors.
On my computer I have run two extensive memory tests (8 cpu cores in
parallel, all tests). The current test has been running for 14 hrs
with no errors. (I think that 8 cores in parallel make this equivalent
to a much longer test with the default single cpu settings.)
Therefore, I do not beieve this issue is caused by RAM.
I'm hoping there is no configuration error or other mistake I made in
setting these systems up that would lead to the problems I'm
experiencing.
BTW, I was able to copy all the files to another drive with no
problem. I used "cp -a" to copy, then I ran "rsync -a" twiice to make
sure nothing was missed. My guess is that I'll be able to copy this
right back onto the root filesystem after I resolve whatever the
problem is and my operating system will be back to the same state it
was in prior to this problem.
OK, I'm off to try btrfs check without --repair... thanks again!
For reference:
btrfs-progs v4.6.1
Linux 4.6.4-1-ARCH #1 SMP PREEMPT Mon Jul 11 19:12:32 CEST 2016 x86_64 GNU/Linux
On Wed, Aug 10, 2016 at 5:21 PM, Chris Murphy <lists@colorremedies.com> wrote:
> I'm using LUKS, aes xts-plain64, on six devices. One is using mixed-bg
> single device. One is dsingle mdup. And then 2x2 mraid1 draid1. I've
> had zero problems. The two computers these run on do have aesni
> support. Aging wise, they're all at least a year old. But I've been
> using Btrfs on LUKS for much longer than that.
>
>
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-10 22:01 ` Dave T
@ 2016-08-10 22:23 ` Chris Murphy
2016-08-10 22:52 ` Dave T
2016-08-11 7:18 ` Andrei Borzenkov
2016-08-11 4:50 ` Duncan
1 sibling, 2 replies; 28+ messages in thread
From: Chris Murphy @ 2016-08-10 22:23 UTC (permalink / raw)
To: Dave T; +Cc: Chris Murphy, Duncan, Austin Hemmelgarn, Btrfs BTRFS
On Wed, Aug 10, 2016 at 4:01 PM, Dave T <davestechshop@gmail.com> wrote:
> I will be very disappointed if I cannot use btrfs + dm-crypt. As far
> as I can see, there is no alternative given that I need to use
> snapshots (and LVM, as good as it is, has severe performance penalties
> for its snapshots).
See LVM thin provisioning snapshots. I haven't benchmarked it, but
it's a night and day difference from conventional (thick) snapshots.
The gotchas are currently there's no raid support, and the snapshots
are whole volume. So each snapshot appears as a volume with the same
UUID as the original, and by default they're not active. So for me
it's a bit of a head scratcher what happens when mounting a snapshot
concurrent with another. For Btrfs this ends badly. For XFS it refuses
unless using nouuid, but still seems capable of writing to the two
volumes without causing problems.
But yes, I like Btrfs snapshots and refinks better. *shrug*
If you find a Btrfs on dmcrypt problem, it's a serious bug, and I
think it would get attention very quickly.
--
Chris Murphy
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-10 21:15 ` Chris Murphy
@ 2016-08-10 22:50 ` Dave T
0 siblings, 0 replies; 28+ messages in thread
From: Dave T @ 2016-08-10 22:50 UTC (permalink / raw)
To: Chris Murphy; +Cc: Btrfs BTRFS
see below
On Wed, Aug 10, 2016 at 5:15 PM, Chris Murphy <lists@colorremedies.com> wrote:
> 1. Report 'btrfs check' without --repair, let's see what it complains
> about and if it might be able to plausibly fix this.
First, a small part of the dmesg output:
[ 172.772283] Btrfs loaded
[ 172.772632] BTRFS: device label top_level devid 1 transid 103495 /dev/dm-0
[ 274.320762] BTRFS info (device dm-0): use lzo compression
[ 274.320764] BTRFS info (device dm-0): disk space caching is enabled
[ 274.320764] BTRFS: has skinny extents
[ 274.322555] BTRFS info (device dm-0): bdev /dev/mapper/sysluks
errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
[ 274.329965] BTRFS: detected SSD devices, enabling SSD mode
Now, full output of btrfs check without repair option.
checking extents
bad metadata [292414541824, 292414558208) crossing stripe boundary
bad metadata [292414607360, 292414623744) crossing stripe boundary
bad metadata [292414672896, 292414689280) crossing stripe boundary
bad metadata [292414738432, 292414754816) crossing stripe boundary
bad metadata [292415787008, 292415803392) crossing stripe boundary
bad metadata [292415918080, 292415934464) crossing stripe boundary
bad metadata [292416376832, 292416393216) crossing stripe boundary
bad metadata [292418015232, 292418031616) crossing stripe boundary
bad metadata [292419325952, 292419342336) crossing stripe boundary
bad metadata [292419588096, 292419604480) crossing stripe boundary
bad metadata [292419915776, 292419932160) crossing stripe boundary
bad metadata [292422930432, 292422946816) crossing stripe boundary
bad metadata [292423061504, 292423077888) crossing stripe boundary
ref mismatch on [292423155712 16384] extent item 1, found 0
Backref 292423155712 root 258 not referenced back 0x2280a20
Incorrect global backref count on 292423155712 found 1 wanted 0
backpointer mismatch on [292423155712 16384]
owner ref check failed [292423155712 16384]
bad metadata [292423192576, 292423208960) crossing stripe boundary
bad metadata [292423323648, 292423340032) crossing stripe boundary
bad metadata [292429549568, 292429565952) crossing stripe boundary
bad metadata [292439904256, 292439920640) crossing stripe boundary
bad metadata [292440297472, 292440313856) crossing stripe boundary
bad metadata [292442525696, 292442542080) crossing stripe boundary
bad metadata [292443770880, 292443787264) crossing stripe boundary
bad metadata [292443967488, 292443983872) crossing stripe boundary
bad metadata [292444033024, 292444049408) crossing stripe boundary
bad metadata [292444098560, 292444114944) crossing stripe boundary
bad metadata [292444164096, 292444180480) crossing stripe boundary
bad metadata [292444229632, 292444246016) crossing stripe boundary
bad metadata [292444688384, 292444704768) crossing stripe boundary
bad metadata [292444884992, 292444901376) crossing stripe boundary
bad metadata [292445081600, 292445097984) crossing stripe boundary
bad metadata [292446720000, 292446736384) crossing stripe boundary
bad metadata [292448948224, 292448964608) crossing stripe boundary
Error: could not find btree root extent for root 258
Checking filesystem on /dev/mapper/cryptroot
UUID:
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-10 22:23 ` Chris Murphy
@ 2016-08-10 22:52 ` Dave T
2016-08-11 14:12 ` Nicholas D Steeves
2016-08-11 7:18 ` Andrei Borzenkov
1 sibling, 1 reply; 28+ messages in thread
From: Dave T @ 2016-08-10 22:52 UTC (permalink / raw)
To: Chris Murphy; +Cc: Duncan, Austin Hemmelgarn, Btrfs BTRFS
Apologies. I have to make a correction to the message I just sent.
Disregard that message and use this one:
On Wed, Aug 10, 2016 at 5:15 PM, Chris Murphy <lists@colorremedies.com> wrote:
> 1. Report 'btrfs check' without --repair, let's see what it complains
> about and if it might be able to plausibly fix this.
First, a small part of the dmesg output:
[ 172.772283] Btrfs loaded
[ 172.772632] BTRFS: device label top_level devid 1 transid 103495 /dev/dm-0
[ 274.320762] BTRFS info (device dm-0): use lzo compression
[ 274.320764] BTRFS info (device dm-0): disk space caching is enabled
[ 274.320764] BTRFS: has skinny extents
[ 274.322555] BTRFS info (device dm-0): bdev /dev/mapper/cryptroot
errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
[ 274.329965] BTRFS: detected SSD devices, enabling SSD mode
Now, full output of btrfs check without repair option.
checking extents
bad metadata [292414541824, 292414558208) crossing stripe boundary
bad metadata [292414607360, 292414623744) crossing stripe boundary
bad metadata [292414672896, 292414689280) crossing stripe boundary
bad metadata [292414738432, 292414754816) crossing stripe boundary
bad metadata [292415787008, 292415803392) crossing stripe boundary
bad metadata [292415918080, 292415934464) crossing stripe boundary
bad metadata [292416376832, 292416393216) crossing stripe boundary
bad metadata [292418015232, 292418031616) crossing stripe boundary
bad metadata [292419325952, 292419342336) crossing stripe boundary
bad metadata [292419588096, 292419604480) crossing stripe boundary
bad metadata [292419915776, 292419932160) crossing stripe boundary
bad metadata [292422930432, 292422946816) crossing stripe boundary
bad metadata [292423061504, 292423077888) crossing stripe boundary
ref mismatch on [292423155712 16384] extent item 1, found 0
Backref 292423155712 root 258 not referenced back 0x2280a20
Incorrect global backref count on 292423155712 found 1 wanted 0
backpointer mismatch on [292423155712 16384]
owner ref check failed [292423155712 16384]
bad metadata [292423192576, 292423208960) crossing stripe boundary
bad metadata [292423323648, 292423340032) crossing stripe boundary
bad metadata [292429549568, 292429565952) crossing stripe boundary
bad metadata [292439904256, 292439920640) crossing stripe boundary
bad metadata [292440297472, 292440313856) crossing stripe boundary
bad metadata [292442525696, 292442542080) crossing stripe boundary
bad metadata [292443770880, 292443787264) crossing stripe boundary
bad metadata [292443967488, 292443983872) crossing stripe boundary
bad metadata [292444033024, 292444049408) crossing stripe boundary
bad metadata [292444098560, 292444114944) crossing stripe boundary
bad metadata [292444164096, 292444180480) crossing stripe boundary
bad metadata [292444229632, 292444246016) crossing stripe boundary
bad metadata [292444688384, 292444704768) crossing stripe boundary
bad metadata [292444884992, 292444901376) crossing stripe boundary
bad metadata [292445081600, 292445097984) crossing stripe boundary
bad metadata [292446720000, 292446736384) crossing stripe boundary
bad metadata [292448948224, 292448964608) crossing stripe boundary
Error: could not find btree root extent for root 258
Checking filesystem on /dev/mapper/cryptroot
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-10 22:01 ` Dave T
2016-08-10 22:23 ` Chris Murphy
@ 2016-08-11 4:50 ` Duncan
2016-08-11 5:06 ` Gareth Pye
1 sibling, 1 reply; 28+ messages in thread
From: Duncan @ 2016-08-11 4:50 UTC (permalink / raw)
To: linux-btrfs
Dave T posted on Wed, 10 Aug 2016 18:01:44 -0400 as excerpted:
> Does anyone have any thoughts about using dup mode for metadata on a
> Samsung 950 Pro (or any NVMe drive)?
The biggest problem with dup on ssds is that some ssds (particularly the
ones with the sandforce controllers) do dedup, so you'd be having btrfs
do dup while the filesystem dedups, to no effect except more cpu and
device processing!
(The other argument for single on ssd that I've seen is that because the
FTL ultimately places the data, and because both copies are written at
the same time, there's a good chance that the FTL will write them into
the same erase block and area, and a defect in one will likely be a
defect in the other as well. That may or may not be, I'm not qualified
to say, but as explained below, I do choose to take my chances on that
and thus do run dup on ssd.)
So as long as the SSD doesn't have a deduping FTL, I'd suggest dup for
metadata on ssd does make sense. Data... not so sure on, but certainly
metadata, because one bad block of metadata can be many messed up files.
On my ssds here, which I know don't do dedup, most of my btrfs are raid1
on the pair of ssds. However, /boot is different since I can't really
point grub at two different /boots, so I have my working /boot on one
device, with the backup /boot on the other, and the grub on each one
pointed at its respective /boot, so I can select working or backup /boot
from the BIOS and it'll just work. Since /boot is so small, it's mixed-
mode chunks, meaning data and metadata are mixed together and the
redundancy mode applies to both at once instead of each separately. And
I chose dup, so it's dup for both data and metadata.
Works fine, dup for both data and metadata on non-deduping ssds, but of
course that means data takes double the space since there's two copies of
it, and that gets kind of expensive on ssd, if it's more than the
fraction of a GiB that's /boot.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-11 4:50 ` Duncan
@ 2016-08-11 5:06 ` Gareth Pye
2016-08-11 8:20 ` Duncan
0 siblings, 1 reply; 28+ messages in thread
From: Gareth Pye @ 2016-08-11 5:06 UTC (permalink / raw)
To: Duncan; +Cc: linux-btrfs
Is there some simple muddling of meta data that could be done to force
dup meta data on deduping SSDs? Like a simple 'random' byte repeated
often enough it would defeat any sane dedup? I know it would waste
data but clearly that is considered worth it with dup metadata (what
is the difference between 50% metadata efficiency and 45%?)
On Thu, Aug 11, 2016 at 2:50 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Dave T posted on Wed, 10 Aug 2016 18:01:44 -0400 as excerpted:
>
>> Does anyone have any thoughts about using dup mode for metadata on a
>> Samsung 950 Pro (or any NVMe drive)?
>
> The biggest problem with dup on ssds is that some ssds (particularly the
> ones with the sandforce controllers) do dedup, so you'd be having btrfs
> do dup while the filesystem dedups, to no effect except more cpu and
> device processing!
>
> (The other argument for single on ssd that I've seen is that because the
> FTL ultimately places the data, and because both copies are written at
> the same time, there's a good chance that the FTL will write them into
> the same erase block and area, and a defect in one will likely be a
> defect in the other as well. That may or may not be, I'm not qualified
> to say, but as explained below, I do choose to take my chances on that
> and thus do run dup on ssd.)
>
> So as long as the SSD doesn't have a deduping FTL, I'd suggest dup for
> metadata on ssd does make sense. Data... not so sure on, but certainly
> metadata, because one bad block of metadata can be many messed up files.
>
> On my ssds here, which I know don't do dedup, most of my btrfs are raid1
> on the pair of ssds. However, /boot is different since I can't really
> point grub at two different /boots, so I have my working /boot on one
> device, with the backup /boot on the other, and the grub on each one
> pointed at its respective /boot, so I can select working or backup /boot
> from the BIOS and it'll just work. Since /boot is so small, it's mixed-
> mode chunks, meaning data and metadata are mixed together and the
> redundancy mode applies to both at once instead of each separately. And
> I chose dup, so it's dup for both data and metadata.
>
> Works fine, dup for both data and metadata on non-deduping ssds, but of
> course that means data takes double the space since there's two copies of
> it, and that gets kind of expensive on ssd, if it's more than the
> fraction of a GiB that's /boot.
>
> --
> Duncan - List replies preferred. No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master." Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Gareth Pye - blog.cerberos.id.au
Level 2 MTG Judge, Melbourne, Australia
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-10 22:23 ` Chris Murphy
2016-08-10 22:52 ` Dave T
@ 2016-08-11 7:18 ` Andrei Borzenkov
1 sibling, 0 replies; 28+ messages in thread
From: Andrei Borzenkov @ 2016-08-11 7:18 UTC (permalink / raw)
To: Chris Murphy; +Cc: Dave T, Duncan, Austin Hemmelgarn, Btrfs BTRFS
On Thu, Aug 11, 2016 at 1:23 AM, Chris Murphy <lists@colorremedies.com> wrote:
> On Wed, Aug 10, 2016 at 4:01 PM, Dave T <davestechshop@gmail.com> wrote:
>
>> I will be very disappointed if I cannot use btrfs + dm-crypt. As far
>> as I can see, there is no alternative given that I need to use
>> snapshots (and LVM, as good as it is, has severe performance penalties
>> for its snapshots).
>
> See LVM thin provisioning snapshots. I haven't benchmarked it, but
> it's a night and day difference from conventional (thick) snapshots.
> The gotchas are currently there's no raid support, and the snapshots
> are whole volume. So each snapshot appears as a volume with the same
> UUID as the original, and by default they're not active. So for me
> it's a bit of a head scratcher what happens when mounting a snapshot
> concurrent with another. For Btrfs this ends badly. For XFS it refuses
> unless using nouuid, but still seems capable of writing to the two
> volumes without causing problems.
>
XFS now allows changing UUID, as do LVM and MD. We can also change
btrfs UUID using "btrfstune -u", but I wonder if there is any way to
change device UUID in this case.
One problem is that even before you come around doing it various udev
rules kick in and create links to wrong instance overwriting previous
ones; and I'm not sure either xfs_admin or btrfstune trigger change
event. So we may end up with stale completely wrong links.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-11 5:06 ` Gareth Pye
@ 2016-08-11 8:20 ` Duncan
0 siblings, 0 replies; 28+ messages in thread
From: Duncan @ 2016-08-11 8:20 UTC (permalink / raw)
To: linux-btrfs
Gareth Pye posted on Thu, 11 Aug 2016 15:06:48 +1000 as excerpted:
> Is there some simple muddling of meta data that could be done to force
> dup meta data on deduping SSDs? Like a simple 'random' byte repeated
> often enough it would defeat any sane dedup? I know it would waste data
> but clearly that is considered worth it with dup metadata (what is the
> difference between 50% metadata efficiency and 45%?)
Well, the FTLs are mostly proprietary, AFAIK, so it's probably hard to
prove the "force", but given the 512-byte sector standard (some are a
multiple of that these days but 512 should be the minimum), in theory one
random byte out of every 512 should do it... unless the compression these
deduping FTLs generally run as well catches that difference and
compresses it out to a different location where it can be compactly
stored, allowing multiple copies of the same 512-byte sector to be stored
in a single sector, so long as they only had a single byte or two
different.
So it could probably be done, but given that the deduping and compression
features of these ssds are listed as just that, features, and that people
buy them for that, it may be that it's better to simply leave well enough
alone. Folks who want dup metadata can set it, and if they haven't
bought one of these ssds with dedup as a feature, they can be reasonably
sure it'll be set. And people who don't care will simply get the
defaults and can live with them the same way that people that don't care
generally live with defaults that may or may not be the absolute best
case for them, but are generally at least not horrible.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-10 22:52 ` Dave T
@ 2016-08-11 14:12 ` Nicholas D Steeves
2016-08-11 14:45 ` Austin S. Hemmelgarn
` (2 more replies)
0 siblings, 3 replies; 28+ messages in thread
From: Nicholas D Steeves @ 2016-08-11 14:12 UTC (permalink / raw)
To: Dave T; +Cc: Chris Murphy, Duncan, Austin Hemmelgarn, Btrfs BTRFS
Why is the combination of dm-crypt|luks+btrfs+compress=lzo as
overlooked as a potential cause? Other than the "raid56 ate my data"
I've noticed a bunch of "luks+btrfs+compress=lzo ate my data" threads.
On 10 August 2016 at 15:46, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
>
> As far as dm-crypt goes, it looks like BTRFS is stable on top in the
> configuration I use (aex-xts-plain64 with a long key using plain dm-crypt
> instead of LUKS). I have heard rumors of issues when using LUKS without
> hardware acceleration, but I've never seen any conclusive proof, and what
> little I've heard sounds more like it was just race conditions elsewhere
> causing the issues.
>
Austin, I'm very curious if they were also using compress=lzo, because
my informal hypothesis is that the encryption+btrfs+compress=lzo
combination precipitates these issues. Maybe the combo is more likely
to trigger these race conditions? It might also be neat to mine the
archive to see these seem to be more likely to occur with fast SSDs vs
slow rotational disks. Do you use compress=lzo?
On 10 August 2016 at 18:52, Dave T <davestechshop@gmail.com> wrote:
> On Wed, Aug 10, 2016 at 5:15 PM, Chris Murphy <lists@colorremedies.com> wrote:
>
>> 1. Report 'btrfs check' without --repair, let's see what it complains
>> about and if it might be able to plausibly fix this.
>
> First, a small part of the dmesg output:
>
> [ 172.772283] Btrfs loaded
> [ 172.772632] BTRFS: device label top_level devid 1 transid 103495 /dev/dm-0
> [ 274.320762] BTRFS info (device dm-0): use lzo compression
Compress=lzo confirmed. Corruption occurred on an SSD.
On 10 August 2016 at 17:21, Chris Murphy <lists@colorremedies.com> wrote:
> I'm using LUKS, aes xts-plain64, on six devices. One is using mixed-bg
> single device. One is dsingle mdup. And then 2x2 mraid1 draid1. I've
> had zero problems. The two computers these run on do have aesni
> support. Aging wise, they're all at least a year old. But I've been
> using Btrfs on LUKS for much longer than that.
>
Chris, do you use compress=lzo? SSDs or rotational disks?
If a bunch of people are using this combo without issue, I'll drop the
informal hypothesis as "just a suspicion informed by sloppy pattern
recognition" ;-)
Thank you!
Nicholas
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-11 14:12 ` Nicholas D Steeves
@ 2016-08-11 14:45 ` Austin S. Hemmelgarn
2016-08-11 19:07 ` Duncan
2016-08-11 20:33 ` Chris Murphy
2 siblings, 0 replies; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2016-08-11 14:45 UTC (permalink / raw)
To: Nicholas D Steeves, Dave T; +Cc: Chris Murphy, Duncan, Btrfs BTRFS
On 2016-08-11 10:12, Nicholas D Steeves wrote:
> Why is the combination of dm-crypt|luks+btrfs+compress=lzo as
> overlooked as a potential cause? Other than the "raid56 ate my data"
> I've noticed a bunch of "luks+btrfs+compress=lzo ate my data" threads.
I haven't personally seen one of those in at least a few months. In
general, BTRFS is moving fast enough that reports older than a kernel
release cycle are generally out of date unless something confirms
otherwise, but I do distinctly recall such issues being commonly
reported in the past.
>
> On 10 August 2016 at 15:46, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
>>
>> As far as dm-crypt goes, it looks like BTRFS is stable on top in the
>> configuration I use (aex-xts-plain64 with a long key using plain dm-crypt
>> instead of LUKS). I have heard rumors of issues when using LUKS without
>> hardware acceleration, but I've never seen any conclusive proof, and what
>> little I've heard sounds more like it was just race conditions elsewhere
>> causing the issues.
>>
>
> Austin, I'm very curious if they were also using compress=lzo, because
> my informal hypothesis is that the encryption+btrfs+compress=lzo
> combination precipitates these issues. Maybe the combo is more likely
> to trigger these race conditions? It might also be neat to mine the
> archive to see these seem to be more likely to occur with fast SSDs vs
> slow rotational disks. Do you use compress=lzo?
In my case, I've tested on both SSD's (both cheap low-end ones and good
Intel and Crucial ones) and traditional hard drives, with and without
compression (both zlib and lzo), and with a couple of different
encryption algorithms (AES, Blowfish, and Threefish). In my case It's
only on plain dm-crypt, not LUKS, but I doubt that particular point will
make much difference. The last test I did was when the merge window for
4.6 closed run as part of the regular regression testing I do, and I'll
be doing another one in the near future. I think the last time I saw
any issues with this in my testing was prior to 4.0, but I don't
remember for sure (most of what I care about is comparison to the
previous version, so i don't keep much in the way of records of specific
things).
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-11 14:12 ` Nicholas D Steeves
2016-08-11 14:45 ` Austin S. Hemmelgarn
@ 2016-08-11 19:07 ` Duncan
2016-08-11 20:43 ` Chris Murphy
2016-08-11 20:33 ` Chris Murphy
2 siblings, 1 reply; 28+ messages in thread
From: Duncan @ 2016-08-11 19:07 UTC (permalink / raw)
To: linux-btrfs
Nicholas D Steeves posted on Thu, 11 Aug 2016 10:12:04 -0400 as excerpted:
> Why is the combination of dm-crypt|luks+btrfs+compress=lzo as overlooked
> as a potential cause? Other than the "raid56 ate my data" I've noticed
> a bunch of "luks+btrfs+compress=lzo ate my data" threads.
My usage is btrfs on physical device (well, on GPT partitions on the
physical device), no encryption, and it's mostly raid1 on paired devices,
but there's definitely one kink that compress=lzo (and I believe
compression in general, including gzip) adds, and it's possible running
it on encryption compounds the issue.
The compression-related problem is this: Btrfs is considerably less
tolerant of checksum-related errors on btrfs-compressed data, and while
on uncompressed btrfs raid1 it will recover from the second copy where
possible and continue, on files that btrfs has compressed, if there are
enough checksum errors, for example in a hard-shutdown situation where
one of the raid1 devices had the updates written but it crashed while
writing the other, btrfs will crash instead of simply falling back to the
good copy.
This is known to be specific to compression; uncompressed btrfs recover
as intended from the second copy. And it's known to occur only when
there's too many checksum errors in a burst -- the filesystem apparently
deals correctly with just a few at a time.
This problem has been ongoing for years -- I thought it was just the way
btrfs worked until someone mentioned that it didn't behave that way
without compression -- and it reasonably regularly prevents a smooth
reboot here after a crash.
In my case I have the system btrfs running read-only by default, so it's
not damaged. However, /home and /var/log are of course mounted writable,
and that's where the problems come in. If I start in (I believe) rescue
mode (it's that or emergency, the other won't do the mounts and won't let
me do them manually either, as it thinks a dependency is missing),
systemd will do the mounts but not start the (permanent) logging or the
services that need to routinely write stuff that I have symlinked into
/home/var/whatever so they can write with a read-only root and system
partition, I can then scrub the mounted home and log partitions to fix
the checksum errors due to one device having the update while the other
doesn't, and continue booting normally. However, if I try directly
booting normally, the system invariably crashes due to too many checksum
errors, even when it /should/ simply read the other copy, which is fine
as demonstrated by the fact that scrub can use it to fix the errors on
the device triggering the checksum errors.
This continued to happen with 4.6. I'm on 4.7 now but am not sure I've
crashed with it and thus can't say for sure whether the problem is fixed
there. However, I doubt it, as the problem has been there apparently
since the compression and raid1 features were introduced, and I didn't
see anything mentioning a fix for the issue in the patches going by on
the list.
The problem is most obvious and reproducible in btrfs raid1 mode, since
there, one device /can/ be behind the other, and scrub /can/ be
demonstrated to fix it so it's obviously a checksum issue, but I'd
imagine if enough checksum mismatches happen on a single device in single
mode, it would crash as well, and of course then there's no second copy
for scrub to fix the bad copy from, so it would simply show up as a btrfs
that can mount but with significant corruption issues that will crash the
system if an attempt to read the affected blocks reads too many at a time.
And to whatever possible extent an encryption layer between the physical
device and btrfs results in possible additional corruption in the event
of a crash or hard shutdown, it could easily compound an already bad
situation.
Meanwhile, /if/ that does turn out to be the root issue here, then
finally fixing the btrfs compression related problem where a large burst
of checksum failures crashes the system, even when there provably exists
a second valid copy, but where this only happens with compression, should
go quite far in stabilizing btrfs on encrypted underlayers.
I know I certainly wouldn't object to the problem being fixed. =:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-11 14:12 ` Nicholas D Steeves
2016-08-11 14:45 ` Austin S. Hemmelgarn
2016-08-11 19:07 ` Duncan
@ 2016-08-11 20:33 ` Chris Murphy
2 siblings, 0 replies; 28+ messages in thread
From: Chris Murphy @ 2016-08-11 20:33 UTC (permalink / raw)
To: Nicholas D Steeves
Cc: Dave T, Chris Murphy, Duncan, Austin Hemmelgarn, Btrfs BTRFS
On Thu, Aug 11, 2016 at 8:12 AM, Nicholas D Steeves <nsteeves@gmail.com> wrote:
>
> Chris, do you use compress=lzo? SSDs or rotational disks?
No compression, SSD and HDD. The stuff I care about is on dmcrypt
(LUKS) for some time. Stuff I sorta care about on plain partitions.
Stuff I don't care much about are either on LVM LV's (usually thinp),
or qcow2.
I have used compression for periods measured in months not years, both
zlib and lzo, on both SSD and HDD, to no ill effect. But it's true
some of the more abrupt and worse damaged file systems did use
compress=lzo. Since lzo is faster and only a bit less better
compression than zlib, it may be more people choose lzo and that's why
it turns out if there's a problem with compression it happens to be
lzo, coincidence rather than causation. I'm not even sure there's
enough information to have correlation.
--
Chris Murphy
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-11 19:07 ` Duncan
@ 2016-08-11 20:43 ` Chris Murphy
2016-08-12 3:11 ` Duncan
0 siblings, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2016-08-11 20:43 UTC (permalink / raw)
To: Duncan; +Cc: Btrfs BTRFS
On Thu, Aug 11, 2016 at 1:07 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> The compression-related problem is this: Btrfs is considerably less
> tolerant of checksum-related errors on btrfs-compressed data,
Why? The data is the data. And why would it matter if it's application
compressed data vs Btrfs compressed data? If there's an error, Btrfs
is intolerant. I don't see how there's a checksum error that Btrfs
tolerates.
But also I don't know if the checksum is predicated on compressed data
or uncompressed data - does the scrub blindly read compressed data,
checksums it, and compares to the previously recorded csum? Or does
the scrub read compressed data, decompresses it, checksums it, then
compares? And does compression compress metadata? I don't think it
does from some of the squashfs testing of the same set of binary files
on ext4 vs btrfs uncompressed vs btrfs compressed. The difference is
explained by inline data being compressed (which it is), so I don't
think the fs itself gets compressed.
Chris Murphy
--
Chris Murphy
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-11 20:43 ` Chris Murphy
@ 2016-08-12 3:11 ` Duncan
2016-08-12 3:51 ` Chris Murphy
0 siblings, 1 reply; 28+ messages in thread
From: Duncan @ 2016-08-12 3:11 UTC (permalink / raw)
To: linux-btrfs
Chris Murphy posted on Thu, 11 Aug 2016 14:43:56 -0600 as excerpted:
> On Thu, Aug 11, 2016 at 1:07 PM, Duncan <1i5t5.duncan@cox.net> wrote:
>> The compression-related problem is this: Btrfs is considerably less
>> tolerant of checksum-related errors on btrfs-compressed data,
>
> Why? The data is the data. And why would it matter if it's application
> compressed data vs Btrfs compressed data? If there's an error, Btrfs is
> intolerant. I don't see how there's a checksum error that Btrfs
> tolerates.
Apparently, the code path for compressed data is sufficiently different,
that when there's a burst of checksum errors, even on raid1 where it
should (and does with scrub) get the correct second copy, it will crash
the system. This is my experience and that of others, and what I thought
was standard btrfs behavior -- I didn't know it was a compression-
specific bug since I use compress on all my btrfs, until someone told me.
When the btrfs compression option hasn't been used on that filesystem, or
presumably when none of that burst of checksum errors is from btrfs-
compressed files, it will grab the second copy and use it as it should,
and there will be no crash. This is as reported by others, including
people who have tested both with and without btrfs-compressed files and
found that it only crashed if the files were btrfs-compressed, whereas it
worked as expected, fetching the valid second copy, if they weren't btrfs-
compressed.
I'd assume this is why this particular bug has remained unsquashed for so
long. The devs are likely testing compression, and bad checksum data
repair from the second copy, but they probably aren't testing bad
checksum repair on compressed data, so the problem isn't showing up in
their tests. Between that and relatively few people running raid1 with
the compression option and seeing enough bad shutdowns to be aware of the
problem, it has mostly flown under the radar. For a long time I myself
thought it was just the way btrfs behaved with bursts of checksum errors,
until someone pointed out that it did /not/ behave that way on btrfs that
didn't have any compressed files when the checksum errors occurred.
> But also I don't know if the checksum is predicated on compressed data
> or uncompressed data - does the scrub blindly read compressed data,
> checksums it, and compares to the previously recorded csum? Or does the
> scrub read compressed data, decompresses it, checksums it, then
> compares? And does compression compress metadata? I don't think it does
> from some of the squashfs testing of the same set of binary files on
> ext4 vs btrfs uncompressed vs btrfs compressed. The difference is
> explained by inline data being compressed (which it is), so I don't
> think the fs itself gets compressed.
As I'm not a coder I can't actually tell you from reading the code, but
AFAIK, both the 128 KiB compression block size and the checksum are on
the uncompressed data. Compression takes place after checksumming.
And I don't believe metadata, whether metadata itself or inline data, is
compressed by btrfs' transparent compression.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-12 3:11 ` Duncan
@ 2016-08-12 3:51 ` Chris Murphy
0 siblings, 0 replies; 28+ messages in thread
From: Chris Murphy @ 2016-08-12 3:51 UTC (permalink / raw)
To: Btrfs BTRFS
On Thu, Aug 11, 2016 at 9:11 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Chris Murphy posted on Thu, 11 Aug 2016 14:43:56 -0600 as excerpted:
>
>> On Thu, Aug 11, 2016 at 1:07 PM, Duncan <1i5t5.duncan@cox.net> wrote:
>>> The compression-related problem is this: Btrfs is considerably less
>>> tolerant of checksum-related errors on btrfs-compressed data,
>>
>> Why? The data is the data. And why would it matter if it's application
>> compressed data vs Btrfs compressed data? If there's an error, Btrfs is
>> intolerant. I don't see how there's a checksum error that Btrfs
>> tolerates.
>
> Apparently, the code path for compressed data is sufficiently different,
> that when there's a burst of checksum errors, even on raid1 where it
> should (and does with scrub) get the correct second copy, it will crash
> the system.
Ahh OK, gotcha.
> This is my experience and that of others, and what I thought
> was standard btrfs behavior -- I didn't know it was a compression-
> specific bug since I use compress on all my btrfs, until someone told me.
>
> When the btrfs compression option hasn't been used on that filesystem, or
> presumably when none of that burst of checksum errors is from btrfs-
> compressed files, it will grab the second copy and use it as it should,
> and there will be no crash. This is as reported by others, including
> people who have tested both with and without btrfs-compressed files and
> found that it only crashed if the files were btrfs-compressed, whereas it
> worked as expected, fetching the valid second copy, if they weren't btrfs-
> compressed.
OK so something's broken.
>
> As I'm not a coder I can't actually tell you from reading the code, but
> AFAIK, both the 128 KiB compression block size and the checksum are on
> the uncompressed data. Compression takes place after checksumming.
>
> And I don't believe metadata, whether metadata itself or inline data, is
> compressed by btrfs' transparent compression.
Inline data is definitely compressed.
>From ls -li
263 -rw-r-----. 1 root root 3270 Aug 11 21:29 samsung840-256g-hdparm.txt
>From btrfs-debug-tree
item 84 key (263 INODE_ITEM 0) itemoff 7618 itemsize 160
inode generation 7 transid 7 size 3270 nbytes 3270
block group 0 mode 100640 links 1 uid 0 gid 0
rdev 0 flags 0x0
item 85 key (263 INODE_REF 256) itemoff 7582 itemsize 36
inode ref index 8 namelen 26 name: samsung840-256g-hdparm.txt
item 86 key (263 XATTR_ITEM 3817753667) itemoff 7499 itemsize 83
location key (0 UNKNOWN.0 0) type XATTR
namelen 16 datalen 37 name: security.selinux
data unconfined_u:object_r:unlabeled_t:s0
item 87 key (263 EXTENT_DATA 0) itemoff 5860 itemsize 1639
inline extent data size 1618 ram 3270 compress(zlib)
Curiously though, these same small text files once above a certain
size (?) are not compressed if they aren't inline extents.
278 -rw-r-----. 1 root root 11767 Aug 11 21:29 WDCblack-750g-smartctlx_2.txt
item 48 key (278 INODE_ITEM 0) itemoff 7675 itemsize 160
inode generation 7 transid 7 size 11767 nbytes 12288
block group 0 mode 100640 links 1 uid 0 gid 0
rdev 0 flags 0x0
item 49 key (278 INODE_REF 256) itemoff 7636 itemsize 39
inode ref index 23 namelen 29 name: WDCblack-750g-smartctlx_2.txt
item 50 key (278 XATTR_ITEM 3817753667) itemoff 7553 itemsize 83
location key (0 UNKNOWN.0 0) type XATTR
namelen 16 datalen 37 name: security.selinux
data unconfined_u:object_r:unlabeled_t:s0
item 51 key (278 EXTENT_DATA 0) itemoff 7500 itemsize 53
extent data disk byte 12939264 nr 4096
extent data offset 0 nr 12288 ram 12288
extent compression(zlib)
Hrrmm.
--
Chris Murphy
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-10 21:21 ` Chris Murphy
2016-08-10 22:01 ` Dave T
@ 2016-08-12 17:00 ` Patrik Lundquist
1 sibling, 0 replies; 28+ messages in thread
From: Patrik Lundquist @ 2016-08-12 17:00 UTC (permalink / raw)
To: Btrfs BTRFS
On 10 August 2016 at 23:21, Chris Murphy <lists@colorremedies.com> wrote:
>
> I'm using LUKS, aes xts-plain64, on six devices. One is using mixed-bg
> single device. One is dsingle mdup. And then 2x2 mraid1 draid1. I've
> had zero problems. The two computers these run on do have aesni
> support. Aging wise, they're all at least a year old. But I've been
> using Btrfs on LUKS for much longer than that.
FWIW:
I've had 5 spinning disks with LUKS + Btrfs raid1 for 1,5 years.
Also xts-plain64 with AES-NI acceleration.
No problems so far. Not using Btrfs compression.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-12 15:06 ` Duncan
@ 2016-08-15 11:33 ` Austin S. Hemmelgarn
0 siblings, 0 replies; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2016-08-15 11:33 UTC (permalink / raw)
To: linux-btrfs
On 2016-08-12 11:06, Duncan wrote:
> Austin S. Hemmelgarn posted on Fri, 12 Aug 2016 08:04:42 -0400 as
> excerpted:
>
>> On a file server? No, I'd ensure proper physical security is
>> established and make sure it's properly secured against network based
>> attacks and then not worry about it. Unless you have things you want to
>> hide from law enforcement or your government (which may or may not be
>> legal where you live) or can reasonably expect someone to steal the
>> system, you almost certainly don't actually need whole disk encryption.
>> There are two specific exceptions to this though:
>> 1. If your employer requires encryption on this system, that's their
>> call.
>> 2. Encrypted swap is a good thing regardless, because it prevents
>> security credentials from accidentally being written unencrypted to
>> persistent storage.
>
> In the US, medical records are pretty well protected under penalty of law
> (HIPPA, IIRC?). Anyone storing medical records here would do well to
> have full filesystem encryption for that reason.
>
> Of course financial records are sensitive as well, or even just forum
> login information, and then there's the various industrial spies from
> various countries (China being the one most frequently named) that would
> pay good money for unencrypted devices from the right sources.
>
Medical and even financial records really fall under my first exception,
but it's still no substitute for proper physical security. As far as
user account information, that depends on what your legal or PR
department promised, but in many cases there, there's minimal
improvement in security when using full disk encryption in place of just
encrypting the database file used to store the information.
In either case though, it's still a better investment in terms of both
time and money to properly secure the network and physical access to the
hardware. All that disk encryption protects is data at rest, and for a
_server_ system, the data is almost always online, and therefore lack of
protection of the system as a whole is usually more of a security issue
in general than lack of protection for a single disk that's powered off.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-12 12:04 ` Austin S. Hemmelgarn
2016-08-12 15:06 ` Duncan
@ 2016-08-12 17:02 ` Chris Murphy
1 sibling, 0 replies; 28+ messages in thread
From: Chris Murphy @ 2016-08-12 17:02 UTC (permalink / raw)
To: Btrfs BTRFS
On Fri, Aug 12, 2016 at 6:04 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-08-11 16:23, Dave T wrote:
>> 5. Would most of you guys use btrfs + dm-crypt on a production file
>> server (with spinning disks in JBOD configuration -- i.e., no RAID).
>> In this situation, the data is very important, of course. My past
>> experience indicated that RAID only improves uptime, which is not so
>> critical in our environment. Our main criteria is that we should never
>> ever have data loss. As far as I understand it, we do have to use
>> encryption.
>
> On a file server? No, I'd ensure proper physical security is established
> and make sure it's properly secured against network based attacks and then
> not worry about it. Unless you have things you want to hide from law
> enforcement or your government (which may or may not be legal where you
> live) or can reasonably expect someone to steal the system, you almost
> certainly don't actually need whole disk encryption.
Sure but then you need a fairly strict handling policy for those
drives when they leave the environment: e.g. for an RMA if the drive
dies under warranty, or when the drive is being retired. First there's
the actual physical handling (even interception) and accounting of all
of the drives, which has to be rather strict. And second, the fallback
to wiping the drive if it's dead must be physical destruction. For any
data not worth physically destroying the drive for proper disposal,
you can probably forego full disk encryption.
--
Chris Murphy
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-12 12:04 ` Austin S. Hemmelgarn
@ 2016-08-12 15:06 ` Duncan
2016-08-15 11:33 ` Austin S. Hemmelgarn
2016-08-12 17:02 ` Chris Murphy
1 sibling, 1 reply; 28+ messages in thread
From: Duncan @ 2016-08-12 15:06 UTC (permalink / raw)
To: linux-btrfs
Austin S. Hemmelgarn posted on Fri, 12 Aug 2016 08:04:42 -0400 as
excerpted:
> On a file server? No, I'd ensure proper physical security is
> established and make sure it's properly secured against network based
> attacks and then not worry about it. Unless you have things you want to
> hide from law enforcement or your government (which may or may not be
> legal where you live) or can reasonably expect someone to steal the
> system, you almost certainly don't actually need whole disk encryption.
> There are two specific exceptions to this though:
> 1. If your employer requires encryption on this system, that's their
> call.
> 2. Encrypted swap is a good thing regardless, because it prevents
> security credentials from accidentally being written unencrypted to
> persistent storage.
In the US, medical records are pretty well protected under penalty of law
(HIPPA, IIRC?). Anyone storing medical records here would do well to
have full filesystem encryption for that reason.
Of course financial records are sensitive as well, or even just forum
login information, and then there's the various industrial spies from
various countries (China being the one most frequently named) that would
pay good money for unencrypted devices from the right sources.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-11 20:23 Dave T
2016-08-12 4:13 ` Duncan
2016-08-12 8:14 ` Adam Borowski
@ 2016-08-12 12:04 ` Austin S. Hemmelgarn
2016-08-12 15:06 ` Duncan
2016-08-12 17:02 ` Chris Murphy
2 siblings, 2 replies; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2016-08-12 12:04 UTC (permalink / raw)
To: Dave T, Duncan; +Cc: Nicholas D Steeves, Chris Murphy, Btrfs BTRFS
On 2016-08-11 16:23, Dave T wrote:
> What I have gathered so far is the following:
>
> 1. my RAM is not faulty and I feel comfortable ruling out a memory
> error as having anything to do with the reported problem.
>
> 2. my storage device does not seem to be faulty. I have not figured
> out how to do more definitive testing, but smartctl reports it as
> healthy.
Is this just based on smartctl -H, or is it based on looking at all the
info available from smartctl? Based on everything you've said so far,
it sounds to me like there was a group of uncorrectable errors on the
disk, and the sectors in question have now been remapped by the device's
firmware. Such a situation is actually more common than people think
(this is part of the whole 'reinstall to speed up your system' mentality
in the Windows world). I've actually had this happen before (and
correlated the occurrences with spikes in readings from the data-logging
Geiger counter I have next to my home server). Most disks don't start
to report as failing until they get into pretty bad condition (on most
hard drives, it takes a pretty insanely large count of reallocated
sectors to mark the disk as failed in the drive firmware, and on SSD's
you pretty much have to run it out of spare blocks (which takes a _long_
time on many SSD's)).
>
> 3. this problem first happened on a normally running system in light
> use. It had not recently crashed. But the root fs went read-only for
> an unknown reason.
>
> 4. the aftermath of the initial problem may have been exacerbated by
> hard resetting the system, but that's only a guess
>
>> The compression-related problem is this: Btrfs is considerably less tolerant of checksum-related errors on btrfs-compressed data
>
> I'm an unsophisticated user. The argument in support of this statement
> sounds convincing to me. Therefore, I think I should discontinue using
> compression. Anyone disagree?
>
> Is there anything else I should change? (Do I need to provide
> additional information?)
>
> What can I do to find out more about what caused the initial problem.
> I have heard memory errors mentioned, but that's apparently not the
> case here. I have heard crash recovery mentioned, but that isn't how
> my problem initially happened.
>
> I also have a few general questions:
>
> 1. Can one discontinue using the compress mount option if it has been
> used previously? What happens to existing data if the compress mount
> option is 1) added when it wasn't used before, or 2) dropped when it
> had been used.
Yes, it just affects newly written data. If you want to convert
existing data to be uncompressed, you'll need to run 'btrfs filesystem
defrag -r ' on the filesystem to convert things.
>
> 2. I understand that the compress option generally improves btrfs
> performance (via Phoronix article I read in the past; I don't find the
> link). Since encryption has some characteristics in common with
> compression, would one expect any decrease in performance from
> dropping compression when using btrfs on dm-crypt? (For more context,
> with an i7 6700K which has aes-ni, CPU performance should not be a
> bottleneck on my computer.)
I would expect a change in performance in that case, but not necessarily
a decrease. The biggest advantage of compression is that it trades time
spent using the disk for time spent using the CPU. In many cases, this
is a favorable trade-off when your storage is slower than your memory
(because memory speed is really the big limiting factor here, not
processor speed). In your case, the encryption is hardware accelerated,
but the compression isn't, so you should in theory actually get better
performance by turning off compression.
>
> 3. How do I find out if it is appropriate to use dup metadata on a
> Samsung 950 Pro NVMe drive? I don't see deduplication mentioned in the
> drive's datasheet:
> http://www.samsung.com/semiconductor/minisite/ssd/downloads/document/Samsung_SSD_950_PRO_Data_Sheet_Rev_1_2.pdf
Whether or not it does deduplication is hard to answer. If it does,
then you obviously should avoid dup metadata. If it doesn't, then it's
a complex question as to whether or not to use dup metadata. The short
explanation for why is that the SSD firmware maintains a somewhat
arbitrary mapping between LBA's and actual location of the data in
flash, and it tends to group writes from around the same time together
in the flash itself. The argument against dup on SSD's in general takes
this into account, arguing that because the data is likely to be in the
same erase block for both copies, it's not as well protected.
Personally, I run dup on non-deduplicationg SSD's anyway, because I
don't trust higher layers to not potentially mess up one of the copies,
and I still get better performance than most hard disks.
>
> 4. Given that my drive is not reporting problems, does it seem
> reasonable to re-use this drive after the errors I reported? If so,
> how should I do that? Can I simply make a new btrfs filesystem and
> copy my data back? Should I start at a lower level and re-do the
> dm-crypt layer?
If it were me, I'd rebuild from the ground up just to be sure that
everything is in a known working state. That way you can be reasonably
sure any issues are not left over from the previous configuration.
>
> 5. Would most of you guys use btrfs + dm-crypt on a production file
> server (with spinning disks in JBOD configuration -- i.e., no RAID).
> In this situation, the data is very important, of course. My past
> experience indicated that RAID only improves uptime, which is not so
> critical in our environment. Our main criteria is that we should never
> ever have data loss. As far as I understand it, we do have to use
> encryption.
On a file server? No, I'd ensure proper physical security is
established and make sure it's properly secured against network based
attacks and then not worry about it. Unless you have things you want to
hide from law enforcement or your government (which may or may not be
legal where you live) or can reasonably expect someone to steal the
system, you almost certainly don't actually need whole disk encryption.
There are two specific exceptions to this though:
1. If your employer requires encryption on this system, that's their call.
2. Encrypted swap is a good thing regardless, because it prevents
security credentials from accidentally being written unencrypted to
persistent storage.
On my personal systems, I only use encryption for swap space and
security credentials, but I use file based encryption for the
credentials. I also don't store any data that needs absolute protection
against people stealing it though (other than the security credentials,
but I can remotely deauthorize any of those with minimal effort), so
there's not much advantage for me as a user to using disk encryption.
Things are pretty similar at work, except the reasoning there is that we
have good network protection, and restricted access to the server room,
so there's no way realistically without causing significant amounts of
damage elsewhere that the data could be stolen (although we're in a
small enough industry that the only people likely to want to steal our
data is our competitors, and they don't have the funding to pull off
industrial espionage).
Now, as far as RAID, I don't entirely agree about it just improving
up-time. That's one of the big advantages, but it's not the only one.
Having a system that will survive a disk failure and keep working is
good for other reasons too:
1. It makes it less immediately critical that things be dealt with (for
example, if a disk fails in the middle of the night, you can often wait
until the next morning to deal with it).
2. When done right with a system that supports hot-swap properly (all
server systems these days should), it allows for much simpler and much
safer storage device upgrades.
3. It makes it easier (when done with BTRFS or LVM) to re-provision
storage space without having to take the system off-line.
I could have almost any of the Linux servers at work back up and running
correctly from a backup in about 15 minutes, but I still have them set
up with RAID-1 because it lets me do things like install bigger storage
devices with minimal chance of data loss. As for my personal systems,
my home server is set up with RAID in such a way that I can lose 3 of
the 4 hard drives and 1 of the 2 SSD's and still not need to restore
from backup (and still have a working system).
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-11 20:23 Dave T
2016-08-12 4:13 ` Duncan
@ 2016-08-12 8:14 ` Adam Borowski
2016-08-12 12:04 ` Austin S. Hemmelgarn
2 siblings, 0 replies; 28+ messages in thread
From: Adam Borowski @ 2016-08-12 8:14 UTC (permalink / raw)
To: Dave T; +Cc: Btrfs BTRFS
On Thu, Aug 11, 2016 at 04:23:45PM -0400, Dave T wrote:
> 1. Can one discontinue using the compress mount option if it has been
> used previously?
The mount option applies only to newly written blocks, and even then only to
files that don't say otherwise (via chattr +c or +C, btrfs property, etc).
You can change it on the fly (mount -o remount,...), etc.
> What happens to existing data if the compress mount option is 1) added
> when it wasn't used before, or 2) dropped when it had been used.
That data stays compressed or uncompressed, as when it was written. You can
defrag them to change that; balance moves extents without changing their
compression.
> 2. I understand that the compress option generally improves btrfs
> performance (via Phoronix article I read in the past; I don't find the
> link). Since encryption has some characteristics in common with
> compression, would one expect any decrease in performance from
> dropping compression when using btrfs on dm-crypt? (For more context,
> with an i7 6700K which has aes-ni, CPU performance should not be a
> bottleneck on my computer.)
As said elsewhere, compression can drastically help or reduce performance,
this depends on your CPU-to-IO ratio, and to whether you do small random
writes inside files (compress has to rewrite a whole 128KB block).
An extreme data point: Odroid-U2 on eMMC doing Debian archive rebuilds,
compression improves overall throughput by a factor of around two! On the
other hand, this same task on typical machines tends to be CPU bound.
--
An imaginary friend squared is a real enemy.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
2016-08-11 20:23 Dave T
@ 2016-08-12 4:13 ` Duncan
2016-08-12 8:14 ` Adam Borowski
2016-08-12 12:04 ` Austin S. Hemmelgarn
2 siblings, 0 replies; 28+ messages in thread
From: Duncan @ 2016-08-12 4:13 UTC (permalink / raw)
To: linux-btrfs
Dave T posted on Thu, 11 Aug 2016 16:23:45 -0400 as excerpted:
> I also have a few general questions:
>
> 1. Can one discontinue using the compress mount option if it has been
> used previously? What happens to existing data if the compress mount
> option is 1) added when it wasn't used before, or 2) dropped when it had
> been used.
The compress mount option only affects newly written data. Data that was
previously written is automatically decompressed into memory on read,
regardless of whether the compress option is still being used or not.
So you can freely switch between using the option and not, and it'll only
affect newly written files. Existing files stay written the way they
are, unless you do something (like run a recursive defrag with the
compress option) to rewrite them.
> 2. I understand that the compress option generally improves btrfs
> performance (via Phoronix article I read in the past; I don't find the
> link). Since encryption has some characteristics in common with
> compression, would one expect any decrease in performance from dropping
> compression when using btrfs on dm-crypt? (For more context,
> with an i7 6700K which has aes-ni, CPU performance should not be a
> bottleneck on my computer.)
Compression performance works like this (this is a general rule, not
btrfs specific): Compression uses more CPU cycles but results in less
data to actually transfer to and from storage. If your disks are slow
and your CPU is fast (or if the CPU can use hardware accelerated
compression functions), performance will tend to favor compression,
because the bottleneck will be the actual data transfer to and from
storage and the extra overhead of the CPU cycles won't normally matter
while the effect of less data to actually transfer, due to the
compression, will.
But the slower the CPU (and lack of hardware accelerated compression
functions) is and the faster storage IO is, the less of a bottleneck the
actual data transfer will be, and thus the more likely it will be that
the CPU will become the bottleneck, particularly as the compression gets
more efficient size-wise, which generally translates to requiring more
CPU cycles and/or memory to handle it.
Since your storage is PCIE-3.0 @ > 1 GiB/sec, extremely fast, even tho LZO
compression is considered fast (as opposed to size-efficient) as well,
you may actually see /better/ performance without compression, especially
when running CPU-heavy workloads where the extra CPU cycles of
compression will matter as the CPU is already the bottleneck.
Since you're doing encryption also, and that too tends to be CPU
intensive (even if it's hardware accelerated for you), I'd actually be a
bit surprised if you didn't see an increase of performance without
compression, because your storage /is/ so incredibly fast compared to
conventional storage.
But of course if it's really a concern, there's nothing like actually
benchmarking it yourself to see. =:^) But I'd be very surprised if you
actually notice a slowdown, turning compression off. You might not
notice a performance boost either, but I'd be surprised if you notice a
slowdown, tho some artificial benchmarks might show one if they aren't
balancing CPU and IO in something like real-world.
> 3. How do I find out if it is appropriate to use dup metadata on a
> Samsung 950 Pro NVMe drive? I don't see deduplication mentioned in the
> drive's datasheet:
> http://www.samsung.com/semiconductor/minisite/ssd/downloads/document/
Samsung_SSD_950_PRO_Data_Sheet_Rev_1_2.pdf
I'd google the controller. A lot of them will list either compression
and dedup as features as they enhance performance in some cases, or the
stability of constant performance as a feature, as mine, targeted at the
server market, did. If the emphasis is on constant performance and what-
you-see-is-what-you-get storage capacity, then they're not doing
compression and dedup, as that can increase performance and storage
capacity under certain conditions, but it's very unpredictable as it
depends on how much duplication the data has and how compressible it is.
Sandforce controllers, in particular, are known to emphasize compression
and dedup. OTOH, controllers targeted at enterprise or servers are
likely to emphasize stability and predictability and thus not do
transparent compression or dedup.
> 4. Given that my drive is not reporting problems, does it seem
> reasonable to re-use this drive after the errors I reported? If so,
> how should I do that? Can I simply make a new btrfs filesystem and copy
> my data back? Should I start at a lower level and re-do the dm-crypt
> layer?
I'd reuse it here. For hardware that supports/needs trim I'd start at
the bottom layer and work up, but IIRC you said yours doesn't need it,
and by the time you get to the btrfs layer on top of the crypt layer, the
hardware layer should be scrambled zeros and ones in any case, so if it's
true your hardware doesn't need it, I'd guess you should be fine just
doing the mkfs on top of the existing dmcrypted layer.
But I don't use a crypted layer here, so better to rely on others with
experience with it, if you have their answers to rely on.
> 5. Would most of you guys use btrfs + dm-crypt on a production file
> server (with spinning disks in JBOD configuration -- i.e., no RAID).
> In this situation, the data is very important, of course. My past
> experience indicated that RAID only improves uptime, which is not so
> critical in our environment. Our main criteria is that we should never
> ever have data loss. As far as I understand it, we do have to use
> encryption.
I'd suggest, if the data is that important, do btrfs raid1. Because
unlike most raid, btrfs raid takes advantage of btrfs checksumming, and
actually gives you a second copy to fall back on as well as to repair a
bad copy, if the first copy tried fails the checksum test. That level of
run-time-verified data integrity and repair is something most raid
systems don't have -- they'll only use the parity or redundancy to verify
integrity if a device fails or if a scrub is done (and even with a scrub,
in most cases at least for redundant-raid they simply blindly copy the
one device to the others, no real integrity checking at all). But
because btrfs raid1 actually does that real-time integrity checking and
repair, it's a lot stronger in use-cases where data integrity is
paramount.
Tho do note that btrfs raid1 is ONLY two-copy, additional devices
increase capacity, not redundancy. So I'd create two crypted devices of
roughly the same size out of your JBOD, and expose them to btrfs to use
as a raid1.
Or if you want a cold-spare, create three crypted devices of about the
same size, create a btrfs raid1 out of two of them, and keep the third in
reserve to btrfs replace, if needed.
Tho as i said earlier, I don't personally trust btrfs on the crypted
layer yet, so for me, I'd either use something other than btrfs, or use
btrfs but really emphasize the backups, including testing them of course,
because I /don't/ really trust btrfs on crypted just yet. But based on
earlier posts in this thread, I admit it's very possible that all the
reported cases that are the basis for my not trusting btrfs on dmcrypt
yet, were using btrfs compression, and it's possible /that/ was the real
problem, and without it, things will be fine.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive?
@ 2016-08-11 20:23 Dave T
2016-08-12 4:13 ` Duncan
` (2 more replies)
0 siblings, 3 replies; 28+ messages in thread
From: Dave T @ 2016-08-11 20:23 UTC (permalink / raw)
To: Duncan
Cc: Nicholas D Steeves, Chris Murphy, Btrfs BTRFS, Austin S. Hemmelgarn
What I have gathered so far is the following:
1. my RAM is not faulty and I feel comfortable ruling out a memory
error as having anything to do with the reported problem.
2. my storage device does not seem to be faulty. I have not figured
out how to do more definitive testing, but smartctl reports it as
healthy.
3. this problem first happened on a normally running system in light
use. It had not recently crashed. But the root fs went read-only for
an unknown reason.
4. the aftermath of the initial problem may have been exacerbated by
hard resetting the system, but that's only a guess
> The compression-related problem is this: Btrfs is considerably less tolerant of checksum-related errors on btrfs-compressed data
I'm an unsophisticated user. The argument in support of this statement
sounds convincing to me. Therefore, I think I should discontinue using
compression. Anyone disagree?
Is there anything else I should change? (Do I need to provide
additional information?)
What can I do to find out more about what caused the initial problem.
I have heard memory errors mentioned, but that's apparently not the
case here. I have heard crash recovery mentioned, but that isn't how
my problem initially happened.
I also have a few general questions:
1. Can one discontinue using the compress mount option if it has been
used previously? What happens to existing data if the compress mount
option is 1) added when it wasn't used before, or 2) dropped when it
had been used.
2. I understand that the compress option generally improves btrfs
performance (via Phoronix article I read in the past; I don't find the
link). Since encryption has some characteristics in common with
compression, would one expect any decrease in performance from
dropping compression when using btrfs on dm-crypt? (For more context,
with an i7 6700K which has aes-ni, CPU performance should not be a
bottleneck on my computer.)
3. How do I find out if it is appropriate to use dup metadata on a
Samsung 950 Pro NVMe drive? I don't see deduplication mentioned in the
drive's datasheet:
http://www.samsung.com/semiconductor/minisite/ssd/downloads/document/Samsung_SSD_950_PRO_Data_Sheet_Rev_1_2.pdf
4. Given that my drive is not reporting problems, does it seem
reasonable to re-use this drive after the errors I reported? If so,
how should I do that? Can I simply make a new btrfs filesystem and
copy my data back? Should I start at a lower level and re-do the
dm-crypt layer?
5. Would most of you guys use btrfs + dm-crypt on a production file
server (with spinning disks in JBOD configuration -- i.e., no RAID).
In this situation, the data is very important, of course. My past
experience indicated that RAID only improves uptime, which is not so
critical in our environment. Our main criteria is that we should never
ever have data loss. As far as I understand it, we do have to use
encryption.
Thanks for the discussion so far. It's very educational for me.
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2016-08-15 11:33 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-10 3:27 checksum error in metadata node - best way to move root fs to new drive? Dave T
2016-08-10 6:27 ` Duncan
2016-08-10 19:46 ` Austin S. Hemmelgarn
2016-08-10 21:21 ` Chris Murphy
2016-08-10 22:01 ` Dave T
2016-08-10 22:23 ` Chris Murphy
2016-08-10 22:52 ` Dave T
2016-08-11 14:12 ` Nicholas D Steeves
2016-08-11 14:45 ` Austin S. Hemmelgarn
2016-08-11 19:07 ` Duncan
2016-08-11 20:43 ` Chris Murphy
2016-08-12 3:11 ` Duncan
2016-08-12 3:51 ` Chris Murphy
2016-08-11 20:33 ` Chris Murphy
2016-08-11 7:18 ` Andrei Borzenkov
2016-08-11 4:50 ` Duncan
2016-08-11 5:06 ` Gareth Pye
2016-08-11 8:20 ` Duncan
2016-08-12 17:00 ` Patrik Lundquist
2016-08-10 21:15 ` Chris Murphy
2016-08-10 22:50 ` Dave T
2016-08-11 20:23 Dave T
2016-08-12 4:13 ` Duncan
2016-08-12 8:14 ` Adam Borowski
2016-08-12 12:04 ` Austin S. Hemmelgarn
2016-08-12 15:06 ` Duncan
2016-08-15 11:33 ` Austin S. Hemmelgarn
2016-08-12 17:02 ` Chris Murphy
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.