* Re: checksum error in metadata node - best way to move root fs to new drive?
@ 2016-08-11 20:23 Dave T
2016-08-12 4:13 ` Duncan
` (2 more replies)
0 siblings, 3 replies; 28+ messages in thread
From: Dave T @ 2016-08-11 20:23 UTC (permalink / raw)
To: Duncan
Cc: Nicholas D Steeves, Chris Murphy, Btrfs BTRFS, Austin S. Hemmelgarn
What I have gathered so far is the following:
1. my RAM is not faulty and I feel comfortable ruling out a memory
error as having anything to do with the reported problem.
2. my storage device does not seem to be faulty. I have not figured
out how to do more definitive testing, but smartctl reports it as
healthy.
3. this problem first happened on a normally running system in light
use. It had not recently crashed. But the root fs went read-only for
an unknown reason.
4. the aftermath of the initial problem may have been exacerbated by
hard resetting the system, but that's only a guess
> The compression-related problem is this: Btrfs is considerably less tolerant of checksum-related errors on btrfs-compressed data
I'm an unsophisticated user. The argument in support of this statement
sounds convincing to me. Therefore, I think I should discontinue using
compression. Anyone disagree?
Is there anything else I should change? (Do I need to provide
additional information?)
What can I do to find out more about what caused the initial problem.
I have heard memory errors mentioned, but that's apparently not the
case here. I have heard crash recovery mentioned, but that isn't how
my problem initially happened.
I also have a few general questions:
1. Can one discontinue using the compress mount option if it has been
used previously? What happens to existing data if the compress mount
option is 1) added when it wasn't used before, or 2) dropped when it
had been used.
2. I understand that the compress option generally improves btrfs
performance (via Phoronix article I read in the past; I don't find the
link). Since encryption has some characteristics in common with
compression, would one expect any decrease in performance from
dropping compression when using btrfs on dm-crypt? (For more context,
with an i7 6700K which has aes-ni, CPU performance should not be a
bottleneck on my computer.)
3. How do I find out if it is appropriate to use dup metadata on a
Samsung 950 Pro NVMe drive? I don't see deduplication mentioned in the
drive's datasheet:
http://www.samsung.com/semiconductor/minisite/ssd/downloads/document/Samsung_SSD_950_PRO_Data_Sheet_Rev_1_2.pdf
4. Given that my drive is not reporting problems, does it seem
reasonable to re-use this drive after the errors I reported? If so,
how should I do that? Can I simply make a new btrfs filesystem and
copy my data back? Should I start at a lower level and re-do the
dm-crypt layer?
5. Would most of you guys use btrfs + dm-crypt on a production file
server (with spinning disks in JBOD configuration -- i.e., no RAID).
In this situation, the data is very important, of course. My past
experience indicated that RAID only improves uptime, which is not so
critical in our environment. Our main criteria is that we should never
ever have data loss. As far as I understand it, we do have to use
encryption.
Thanks for the discussion so far. It's very educational for me.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-11 20:23 checksum error in metadata node - best way to move root fs to new drive? Dave T @ 2016-08-12 4:13 ` Duncan 2016-08-12 8:14 ` Adam Borowski 2016-08-12 12:04 ` Austin S. Hemmelgarn 2 siblings, 0 replies; 28+ messages in thread From: Duncan @ 2016-08-12 4:13 UTC (permalink / raw) To: linux-btrfs Dave T posted on Thu, 11 Aug 2016 16:23:45 -0400 as excerpted: > I also have a few general questions: > > 1. Can one discontinue using the compress mount option if it has been > used previously? What happens to existing data if the compress mount > option is 1) added when it wasn't used before, or 2) dropped when it had > been used. The compress mount option only affects newly written data. Data that was previously written is automatically decompressed into memory on read, regardless of whether the compress option is still being used or not. So you can freely switch between using the option and not, and it'll only affect newly written files. Existing files stay written the way they are, unless you do something (like run a recursive defrag with the compress option) to rewrite them. > 2. I understand that the compress option generally improves btrfs > performance (via Phoronix article I read in the past; I don't find the > link). Since encryption has some characteristics in common with > compression, would one expect any decrease in performance from dropping > compression when using btrfs on dm-crypt? (For more context, > with an i7 6700K which has aes-ni, CPU performance should not be a > bottleneck on my computer.) Compression performance works like this (this is a general rule, not btrfs specific): Compression uses more CPU cycles but results in less data to actually transfer to and from storage. If your disks are slow and your CPU is fast (or if the CPU can use hardware accelerated compression functions), performance will tend to favor compression, because the bottleneck will be the actual data transfer to and from storage and the extra overhead of the CPU cycles won't normally matter while the effect of less data to actually transfer, due to the compression, will. But the slower the CPU (and lack of hardware accelerated compression functions) is and the faster storage IO is, the less of a bottleneck the actual data transfer will be, and thus the more likely it will be that the CPU will become the bottleneck, particularly as the compression gets more efficient size-wise, which generally translates to requiring more CPU cycles and/or memory to handle it. Since your storage is PCIE-3.0 @ > 1 GiB/sec, extremely fast, even tho LZO compression is considered fast (as opposed to size-efficient) as well, you may actually see /better/ performance without compression, especially when running CPU-heavy workloads where the extra CPU cycles of compression will matter as the CPU is already the bottleneck. Since you're doing encryption also, and that too tends to be CPU intensive (even if it's hardware accelerated for you), I'd actually be a bit surprised if you didn't see an increase of performance without compression, because your storage /is/ so incredibly fast compared to conventional storage. But of course if it's really a concern, there's nothing like actually benchmarking it yourself to see. =:^) But I'd be very surprised if you actually notice a slowdown, turning compression off. You might not notice a performance boost either, but I'd be surprised if you notice a slowdown, tho some artificial benchmarks might show one if they aren't balancing CPU and IO in something like real-world. > 3. How do I find out if it is appropriate to use dup metadata on a > Samsung 950 Pro NVMe drive? I don't see deduplication mentioned in the > drive's datasheet: > http://www.samsung.com/semiconductor/minisite/ssd/downloads/document/ Samsung_SSD_950_PRO_Data_Sheet_Rev_1_2.pdf I'd google the controller. A lot of them will list either compression and dedup as features as they enhance performance in some cases, or the stability of constant performance as a feature, as mine, targeted at the server market, did. If the emphasis is on constant performance and what- you-see-is-what-you-get storage capacity, then they're not doing compression and dedup, as that can increase performance and storage capacity under certain conditions, but it's very unpredictable as it depends on how much duplication the data has and how compressible it is. Sandforce controllers, in particular, are known to emphasize compression and dedup. OTOH, controllers targeted at enterprise or servers are likely to emphasize stability and predictability and thus not do transparent compression or dedup. > 4. Given that my drive is not reporting problems, does it seem > reasonable to re-use this drive after the errors I reported? If so, > how should I do that? Can I simply make a new btrfs filesystem and copy > my data back? Should I start at a lower level and re-do the dm-crypt > layer? I'd reuse it here. For hardware that supports/needs trim I'd start at the bottom layer and work up, but IIRC you said yours doesn't need it, and by the time you get to the btrfs layer on top of the crypt layer, the hardware layer should be scrambled zeros and ones in any case, so if it's true your hardware doesn't need it, I'd guess you should be fine just doing the mkfs on top of the existing dmcrypted layer. But I don't use a crypted layer here, so better to rely on others with experience with it, if you have their answers to rely on. > 5. Would most of you guys use btrfs + dm-crypt on a production file > server (with spinning disks in JBOD configuration -- i.e., no RAID). > In this situation, the data is very important, of course. My past > experience indicated that RAID only improves uptime, which is not so > critical in our environment. Our main criteria is that we should never > ever have data loss. As far as I understand it, we do have to use > encryption. I'd suggest, if the data is that important, do btrfs raid1. Because unlike most raid, btrfs raid takes advantage of btrfs checksumming, and actually gives you a second copy to fall back on as well as to repair a bad copy, if the first copy tried fails the checksum test. That level of run-time-verified data integrity and repair is something most raid systems don't have -- they'll only use the parity or redundancy to verify integrity if a device fails or if a scrub is done (and even with a scrub, in most cases at least for redundant-raid they simply blindly copy the one device to the others, no real integrity checking at all). But because btrfs raid1 actually does that real-time integrity checking and repair, it's a lot stronger in use-cases where data integrity is paramount. Tho do note that btrfs raid1 is ONLY two-copy, additional devices increase capacity, not redundancy. So I'd create two crypted devices of roughly the same size out of your JBOD, and expose them to btrfs to use as a raid1. Or if you want a cold-spare, create three crypted devices of about the same size, create a btrfs raid1 out of two of them, and keep the third in reserve to btrfs replace, if needed. Tho as i said earlier, I don't personally trust btrfs on the crypted layer yet, so for me, I'd either use something other than btrfs, or use btrfs but really emphasize the backups, including testing them of course, because I /don't/ really trust btrfs on crypted just yet. But based on earlier posts in this thread, I admit it's very possible that all the reported cases that are the basis for my not trusting btrfs on dmcrypt yet, were using btrfs compression, and it's possible /that/ was the real problem, and without it, things will be fine. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-11 20:23 checksum error in metadata node - best way to move root fs to new drive? Dave T 2016-08-12 4:13 ` Duncan @ 2016-08-12 8:14 ` Adam Borowski 2016-08-12 12:04 ` Austin S. Hemmelgarn 2 siblings, 0 replies; 28+ messages in thread From: Adam Borowski @ 2016-08-12 8:14 UTC (permalink / raw) To: Dave T; +Cc: Btrfs BTRFS On Thu, Aug 11, 2016 at 04:23:45PM -0400, Dave T wrote: > 1. Can one discontinue using the compress mount option if it has been > used previously? The mount option applies only to newly written blocks, and even then only to files that don't say otherwise (via chattr +c or +C, btrfs property, etc). You can change it on the fly (mount -o remount,...), etc. > What happens to existing data if the compress mount option is 1) added > when it wasn't used before, or 2) dropped when it had been used. That data stays compressed or uncompressed, as when it was written. You can defrag them to change that; balance moves extents without changing their compression. > 2. I understand that the compress option generally improves btrfs > performance (via Phoronix article I read in the past; I don't find the > link). Since encryption has some characteristics in common with > compression, would one expect any decrease in performance from > dropping compression when using btrfs on dm-crypt? (For more context, > with an i7 6700K which has aes-ni, CPU performance should not be a > bottleneck on my computer.) As said elsewhere, compression can drastically help or reduce performance, this depends on your CPU-to-IO ratio, and to whether you do small random writes inside files (compress has to rewrite a whole 128KB block). An extreme data point: Odroid-U2 on eMMC doing Debian archive rebuilds, compression improves overall throughput by a factor of around two! On the other hand, this same task on typical machines tends to be CPU bound. -- An imaginary friend squared is a real enemy. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-11 20:23 checksum error in metadata node - best way to move root fs to new drive? Dave T 2016-08-12 4:13 ` Duncan 2016-08-12 8:14 ` Adam Borowski @ 2016-08-12 12:04 ` Austin S. Hemmelgarn 2016-08-12 15:06 ` Duncan 2016-08-12 17:02 ` Chris Murphy 2 siblings, 2 replies; 28+ messages in thread From: Austin S. Hemmelgarn @ 2016-08-12 12:04 UTC (permalink / raw) To: Dave T, Duncan; +Cc: Nicholas D Steeves, Chris Murphy, Btrfs BTRFS On 2016-08-11 16:23, Dave T wrote: > What I have gathered so far is the following: > > 1. my RAM is not faulty and I feel comfortable ruling out a memory > error as having anything to do with the reported problem. > > 2. my storage device does not seem to be faulty. I have not figured > out how to do more definitive testing, but smartctl reports it as > healthy. Is this just based on smartctl -H, or is it based on looking at all the info available from smartctl? Based on everything you've said so far, it sounds to me like there was a group of uncorrectable errors on the disk, and the sectors in question have now been remapped by the device's firmware. Such a situation is actually more common than people think (this is part of the whole 'reinstall to speed up your system' mentality in the Windows world). I've actually had this happen before (and correlated the occurrences with spikes in readings from the data-logging Geiger counter I have next to my home server). Most disks don't start to report as failing until they get into pretty bad condition (on most hard drives, it takes a pretty insanely large count of reallocated sectors to mark the disk as failed in the drive firmware, and on SSD's you pretty much have to run it out of spare blocks (which takes a _long_ time on many SSD's)). > > 3. this problem first happened on a normally running system in light > use. It had not recently crashed. But the root fs went read-only for > an unknown reason. > > 4. the aftermath of the initial problem may have been exacerbated by > hard resetting the system, but that's only a guess > >> The compression-related problem is this: Btrfs is considerably less tolerant of checksum-related errors on btrfs-compressed data > > I'm an unsophisticated user. The argument in support of this statement > sounds convincing to me. Therefore, I think I should discontinue using > compression. Anyone disagree? > > Is there anything else I should change? (Do I need to provide > additional information?) > > What can I do to find out more about what caused the initial problem. > I have heard memory errors mentioned, but that's apparently not the > case here. I have heard crash recovery mentioned, but that isn't how > my problem initially happened. > > I also have a few general questions: > > 1. Can one discontinue using the compress mount option if it has been > used previously? What happens to existing data if the compress mount > option is 1) added when it wasn't used before, or 2) dropped when it > had been used. Yes, it just affects newly written data. If you want to convert existing data to be uncompressed, you'll need to run 'btrfs filesystem defrag -r ' on the filesystem to convert things. > > 2. I understand that the compress option generally improves btrfs > performance (via Phoronix article I read in the past; I don't find the > link). Since encryption has some characteristics in common with > compression, would one expect any decrease in performance from > dropping compression when using btrfs on dm-crypt? (For more context, > with an i7 6700K which has aes-ni, CPU performance should not be a > bottleneck on my computer.) I would expect a change in performance in that case, but not necessarily a decrease. The biggest advantage of compression is that it trades time spent using the disk for time spent using the CPU. In many cases, this is a favorable trade-off when your storage is slower than your memory (because memory speed is really the big limiting factor here, not processor speed). In your case, the encryption is hardware accelerated, but the compression isn't, so you should in theory actually get better performance by turning off compression. > > 3. How do I find out if it is appropriate to use dup metadata on a > Samsung 950 Pro NVMe drive? I don't see deduplication mentioned in the > drive's datasheet: > http://www.samsung.com/semiconductor/minisite/ssd/downloads/document/Samsung_SSD_950_PRO_Data_Sheet_Rev_1_2.pdf Whether or not it does deduplication is hard to answer. If it does, then you obviously should avoid dup metadata. If it doesn't, then it's a complex question as to whether or not to use dup metadata. The short explanation for why is that the SSD firmware maintains a somewhat arbitrary mapping between LBA's and actual location of the data in flash, and it tends to group writes from around the same time together in the flash itself. The argument against dup on SSD's in general takes this into account, arguing that because the data is likely to be in the same erase block for both copies, it's not as well protected. Personally, I run dup on non-deduplicationg SSD's anyway, because I don't trust higher layers to not potentially mess up one of the copies, and I still get better performance than most hard disks. > > 4. Given that my drive is not reporting problems, does it seem > reasonable to re-use this drive after the errors I reported? If so, > how should I do that? Can I simply make a new btrfs filesystem and > copy my data back? Should I start at a lower level and re-do the > dm-crypt layer? If it were me, I'd rebuild from the ground up just to be sure that everything is in a known working state. That way you can be reasonably sure any issues are not left over from the previous configuration. > > 5. Would most of you guys use btrfs + dm-crypt on a production file > server (with spinning disks in JBOD configuration -- i.e., no RAID). > In this situation, the data is very important, of course. My past > experience indicated that RAID only improves uptime, which is not so > critical in our environment. Our main criteria is that we should never > ever have data loss. As far as I understand it, we do have to use > encryption. On a file server? No, I'd ensure proper physical security is established and make sure it's properly secured against network based attacks and then not worry about it. Unless you have things you want to hide from law enforcement or your government (which may or may not be legal where you live) or can reasonably expect someone to steal the system, you almost certainly don't actually need whole disk encryption. There are two specific exceptions to this though: 1. If your employer requires encryption on this system, that's their call. 2. Encrypted swap is a good thing regardless, because it prevents security credentials from accidentally being written unencrypted to persistent storage. On my personal systems, I only use encryption for swap space and security credentials, but I use file based encryption for the credentials. I also don't store any data that needs absolute protection against people stealing it though (other than the security credentials, but I can remotely deauthorize any of those with minimal effort), so there's not much advantage for me as a user to using disk encryption. Things are pretty similar at work, except the reasoning there is that we have good network protection, and restricted access to the server room, so there's no way realistically without causing significant amounts of damage elsewhere that the data could be stolen (although we're in a small enough industry that the only people likely to want to steal our data is our competitors, and they don't have the funding to pull off industrial espionage). Now, as far as RAID, I don't entirely agree about it just improving up-time. That's one of the big advantages, but it's not the only one. Having a system that will survive a disk failure and keep working is good for other reasons too: 1. It makes it less immediately critical that things be dealt with (for example, if a disk fails in the middle of the night, you can often wait until the next morning to deal with it). 2. When done right with a system that supports hot-swap properly (all server systems these days should), it allows for much simpler and much safer storage device upgrades. 3. It makes it easier (when done with BTRFS or LVM) to re-provision storage space without having to take the system off-line. I could have almost any of the Linux servers at work back up and running correctly from a backup in about 15 minutes, but I still have them set up with RAID-1 because it lets me do things like install bigger storage devices with minimal chance of data loss. As for my personal systems, my home server is set up with RAID in such a way that I can lose 3 of the 4 hard drives and 1 of the 2 SSD's and still not need to restore from backup (and still have a working system). ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-12 12:04 ` Austin S. Hemmelgarn @ 2016-08-12 15:06 ` Duncan 2016-08-15 11:33 ` Austin S. Hemmelgarn 2016-08-12 17:02 ` Chris Murphy 1 sibling, 1 reply; 28+ messages in thread From: Duncan @ 2016-08-12 15:06 UTC (permalink / raw) To: linux-btrfs Austin S. Hemmelgarn posted on Fri, 12 Aug 2016 08:04:42 -0400 as excerpted: > On a file server? No, I'd ensure proper physical security is > established and make sure it's properly secured against network based > attacks and then not worry about it. Unless you have things you want to > hide from law enforcement or your government (which may or may not be > legal where you live) or can reasonably expect someone to steal the > system, you almost certainly don't actually need whole disk encryption. > There are two specific exceptions to this though: > 1. If your employer requires encryption on this system, that's their > call. > 2. Encrypted swap is a good thing regardless, because it prevents > security credentials from accidentally being written unencrypted to > persistent storage. In the US, medical records are pretty well protected under penalty of law (HIPPA, IIRC?). Anyone storing medical records here would do well to have full filesystem encryption for that reason. Of course financial records are sensitive as well, or even just forum login information, and then there's the various industrial spies from various countries (China being the one most frequently named) that would pay good money for unencrypted devices from the right sources. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-12 15:06 ` Duncan @ 2016-08-15 11:33 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 28+ messages in thread From: Austin S. Hemmelgarn @ 2016-08-15 11:33 UTC (permalink / raw) To: linux-btrfs On 2016-08-12 11:06, Duncan wrote: > Austin S. Hemmelgarn posted on Fri, 12 Aug 2016 08:04:42 -0400 as > excerpted: > >> On a file server? No, I'd ensure proper physical security is >> established and make sure it's properly secured against network based >> attacks and then not worry about it. Unless you have things you want to >> hide from law enforcement or your government (which may or may not be >> legal where you live) or can reasonably expect someone to steal the >> system, you almost certainly don't actually need whole disk encryption. >> There are two specific exceptions to this though: >> 1. If your employer requires encryption on this system, that's their >> call. >> 2. Encrypted swap is a good thing regardless, because it prevents >> security credentials from accidentally being written unencrypted to >> persistent storage. > > In the US, medical records are pretty well protected under penalty of law > (HIPPA, IIRC?). Anyone storing medical records here would do well to > have full filesystem encryption for that reason. > > Of course financial records are sensitive as well, or even just forum > login information, and then there's the various industrial spies from > various countries (China being the one most frequently named) that would > pay good money for unencrypted devices from the right sources. > Medical and even financial records really fall under my first exception, but it's still no substitute for proper physical security. As far as user account information, that depends on what your legal or PR department promised, but in many cases there, there's minimal improvement in security when using full disk encryption in place of just encrypting the database file used to store the information. In either case though, it's still a better investment in terms of both time and money to properly secure the network and physical access to the hardware. All that disk encryption protects is data at rest, and for a _server_ system, the data is almost always online, and therefore lack of protection of the system as a whole is usually more of a security issue in general than lack of protection for a single disk that's powered off. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-12 12:04 ` Austin S. Hemmelgarn 2016-08-12 15:06 ` Duncan @ 2016-08-12 17:02 ` Chris Murphy 1 sibling, 0 replies; 28+ messages in thread From: Chris Murphy @ 2016-08-12 17:02 UTC (permalink / raw) To: Btrfs BTRFS On Fri, Aug 12, 2016 at 6:04 AM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2016-08-11 16:23, Dave T wrote: >> 5. Would most of you guys use btrfs + dm-crypt on a production file >> server (with spinning disks in JBOD configuration -- i.e., no RAID). >> In this situation, the data is very important, of course. My past >> experience indicated that RAID only improves uptime, which is not so >> critical in our environment. Our main criteria is that we should never >> ever have data loss. As far as I understand it, we do have to use >> encryption. > > On a file server? No, I'd ensure proper physical security is established > and make sure it's properly secured against network based attacks and then > not worry about it. Unless you have things you want to hide from law > enforcement or your government (which may or may not be legal where you > live) or can reasonably expect someone to steal the system, you almost > certainly don't actually need whole disk encryption. Sure but then you need a fairly strict handling policy for those drives when they leave the environment: e.g. for an RMA if the drive dies under warranty, or when the drive is being retired. First there's the actual physical handling (even interception) and accounting of all of the drives, which has to be rather strict. And second, the fallback to wiping the drive if it's dead must be physical destruction. For any data not worth physically destroying the drive for proper disposal, you can probably forego full disk encryption. -- Chris Murphy ^ permalink raw reply [flat|nested] 28+ messages in thread
* checksum error in metadata node - best way to move root fs to new drive? @ 2016-08-10 3:27 Dave T 2016-08-10 6:27 ` Duncan 2016-08-10 21:15 ` Chris Murphy 0 siblings, 2 replies; 28+ messages in thread From: Dave T @ 2016-08-10 3:27 UTC (permalink / raw) To: linux-btrfs btrfs scrub returned with uncorrectable errors. Searching in dmesg returns the following information: BTRFS warning (device dm-0): checksum error at logical NNNNN on /dev/mapper/[crypto] sector: yyyyy metadata node (level 2) in tree 250 it also says: unable to fixup (regular) error at logical NNNNNN on /dev/mapper/[crypto] I assume I have a bad block device. Does that seem correct? The important data is backed up. However, it would save me a lot of time reinstalling the operating system and setting up my work environment if I can copy this root filesystem to another storage device. Can I do that, considering the errors I have mentioned?? With the uncorrectable error being in a metadata node, what (if anything) does that imply about restoring from this drive? If I can copy this entire root filesystem, what is the best way to do it? The btrfs restore tool? cp? rsync? Some cloning tool? Other options? If I use the btrfs restore tool, should I use options x, m and S? In particular I wonder exactly what the S option does. If I leave S out, are all symlinks ignored? I'm trying to save time and clone this so that I get the operating system and all my tweaks / configurations back. As I said, the really important data is separately backed up. I appreciate all suggestions. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-10 3:27 Dave T @ 2016-08-10 6:27 ` Duncan 2016-08-10 19:46 ` Austin S. Hemmelgarn 2016-08-10 21:21 ` Chris Murphy 2016-08-10 21:15 ` Chris Murphy 1 sibling, 2 replies; 28+ messages in thread From: Duncan @ 2016-08-10 6:27 UTC (permalink / raw) To: linux-btrfs Dave T posted on Tue, 09 Aug 2016 23:27:56 -0400 as excerpted: > btrfs scrub returned with uncorrectable errors. Searching in dmesg > returns the following information: > > BTRFS warning (device dm-0): checksum error at logical NNNNN on > /dev/mapper/[crypto] sector: yyyyy metadata node (level 2) in tree 250 > > it also says: > > unable to fixup (regular) error at logical NNNNNN on > /dev/mapper/[crypto] > > > I assume I have a bad block device. Does that seem correct? The > important data is backed up. > > However, it would save me a lot of time reinstalling the operating > system and setting up my work environment if I can copy this root > filesystem to another storage device. > > Can I do that, considering the errors I have mentioned?? With the > uncorrectable error being in a metadata node, what (if anything) does > that imply about restoring from this drive? Well, given that I don't see any other people more qualified than I, as a simple btrfs user and list regular, tho not a dmcrypt user and definitely not a btrfs dev, posting, I'll try to help, but... Do you know what data and metadata replication modes you were using? Scrub detects checksum errors, and for raid1 mode on multi-device (but I guess you were single device) and dup mode on single device, it will try the other copy and use it if the checksum passes there, repairing the bad copy as well. But until recently dup mode data on single device was impossible, so I doubt you were using that, and while dup mode metadata was the normal default, on ssd that changes to single mode as well. Which means if you were using ssd defaults, you got single mode for both data and metadata, and scrub can detect but not correct checksum errors. That doesn't directly answer your question, but it does explain why/that you couldn't /expect/ scrub to fix checksum problems, only detect them, if both data and metadata are single mode. Meanwhile, in a different post you asked about btrfs on dmcrypt. I'm not aware of any direct btrfs-on-dmcrypt specific bugs (tho I'm just a btrfs user and list regular, not a dev, so could have missed something), but certainly, the dmcrypt layer doesn't simplify things. There was a guy here, Mark MERLIN, worked for google I believe and was on the road frequently, that was using btrfs on dmcrypt for his laptop and various btrfs on his servers as well -- he wrote some of the raid56 mode stuff on the wiki based on his own experiments with it. But I haven't seen him around recently. I'd suggest he'd be the guy to talk to about btrfs on dmcrypt if you can get in contact with him, as he seemed to have more experience with it than anyone else around here. But like I said I haven't seen him around recently... Put it this way. If it were my data on the line, I'd either (1) use another filesystem on top of dmcrypt, if I really wanted/needed the crypted layer, or (2) do without the crypted layer, or (3) use btrfs but be extra vigilant with backups. This since while I know of no specific bugs in btrfs-on-dmcrypt case, I don't particularly trust it either, and Marc MERLIN's posted troubles with the combo were enough to have me avoiding it if possible, and being extra careful with backups if not. > If I can copy this entire root filesystem, what is the best way to do > it? The btrfs restore tool? cp? rsync? Some cloning tool? Other options? It depends on if the filesystem is mountable and if so, how much can be retrieved without error, the latter of which depends on the extent of that metadata damage, since damaged metadata will likely take out multiple files, and depending on what level of the tree the damage was on, it could take out only a few files, or most of the filesystem! If you can mount and the damage appears to be limited, I'd try mounting read-only and copying what I could off, using conventional methods. That way you get checksum protection, which should help assure that anything successfully copied isn't corrupted, because btrfs will error out if there's checksum errors and it won't copy successfully. If it won't mount or it will but the damage appears to be extensive, I'd suggest using restore. It's read-only in terms of the filesystem it's restoring from, so shouldn't cause further damage -- unless the device is actively decaying as you use it, in which case the first thing I'd try to do is image it to something else so the damage isn't getting worse as you work with it. But AFAIK restore doesn't give you the checksum protection, so anything restored that way /could/ be corrupt (tho it's worth noting that ordinary filesystems don't do checksum protection anyway, so it's important not to consider the file any more damaged just because it wasn't checksum protected than it would be if you simply retrieved it from say an ext4 filesystem and didn't have some other method to verify the file). Altho... working on dmcrypt, I suppose it's likely that anything that's corrupted turns up entirely scrambled and useless anyway -- you may not be able to retrieve for example a video file with some dropouts as may be the case on unencrypted storage, but have a totally scrambled and useless file, or at least that file block (4K), instead. > If I use the btrfs restore tool, should I use options x, m and S? In > particular I wonder exactly what the S option does. If I leave S out, > are all symlinks ignored? Symlinks are not restored without -S, correct. That and -m are both relatively new restore options -- back when I first used restore you simply didn't get that back. If it's primarily just data files and you don't really care about ownership/permissions or date metadata, you can leave the -m off to simplify the process slightly. In that case, the files will be written just as any other new file would be written, as the user (root) the app is running as, subject to the current umask. Else use the -m and restore will try to restore ownership/permissions/dates metadata as well. Similarly, you may or may not need -x for the extended attributes. Unless you're using selinux and its security attributes, or capacities to avoid running as superuser (and those both apply primarily to executables), chances are fairly good that unless you specifically know you need extended attributes restored, you don't, and can skip that option. > I'm trying to save time and clone this so that I get the operating > system and all my tweaks / configurations back. As I said, the really > important data is separately backed up. Good. =:^) Sounds about like me. I do periodic backups, but have run restore a couple times when a filesystem wouldn't mount, in ordered to get back as much of the delta between the last backup and current as possible. Of course I know not doing more frequent backups is a calculated risk and I was prepared to have to redo anything changed since the backup if necessary, but it's nice to have a tool like btrfs restore that can make it unnecessary under certain conditions where it otherwise would be. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-10 6:27 ` Duncan @ 2016-08-10 19:46 ` Austin S. Hemmelgarn 2016-08-10 21:21 ` Chris Murphy 1 sibling, 0 replies; 28+ messages in thread From: Austin S. Hemmelgarn @ 2016-08-10 19:46 UTC (permalink / raw) To: linux-btrfs On 2016-08-10 02:27, Duncan wrote: > Dave T posted on Tue, 09 Aug 2016 23:27:56 -0400 as excerpted: > >> btrfs scrub returned with uncorrectable errors. Searching in dmesg >> returns the following information: >> >> BTRFS warning (device dm-0): checksum error at logical NNNNN on >> /dev/mapper/[crypto] sector: yyyyy metadata node (level 2) in tree 250 >> >> it also says: >> >> unable to fixup (regular) error at logical NNNNNN on >> /dev/mapper/[crypto] >> >> >> I assume I have a bad block device. Does that seem correct? The >> important data is backed up. >> >> However, it would save me a lot of time reinstalling the operating >> system and setting up my work environment if I can copy this root >> filesystem to another storage device. >> >> Can I do that, considering the errors I have mentioned?? With the >> uncorrectable error being in a metadata node, what (if anything) does >> that imply about restoring from this drive? > > Well, given that I don't see any other people more qualified than I, as a > simple btrfs user and list regular, tho not a dmcrypt user and definitely > not a btrfs dev, posting, I'll try to help, but... I probably would have replied, if I had seen the e-mail before now. GMail apparently really hates me recently, as I keep getting things hours to days after other people and regularly out of order... As usual though, you seem to have already covered everything important pretty well, I've only got a few comments to add below. > > Do you know what data and metadata replication modes you were using? > Scrub detects checksum errors, and for raid1 mode on multi-device (but I > guess you were single device) and dup mode on single device, it will try > the other copy and use it if the checksum passes there, repairing the bad > copy as well. > > But until recently dup mode data on single device was impossible, so I > doubt you were using that, and while dup mode metadata was the normal > default, on ssd that changes to single mode as well. > > Which means if you were using ssd defaults, you got single mode for both > data and metadata, and scrub can detect but not correct checksum errors. > > That doesn't directly answer your question, but it does explain why/that > you couldn't /expect/ scrub to fix checksum problems, only detect them, > if both data and metadata are single mode. > > Meanwhile, in a different post you asked about btrfs on dmcrypt. I'm not > aware of any direct btrfs-on-dmcrypt specific bugs (tho I'm just a btrfs > user and list regular, not a dev, so could have missed something), but > certainly, the dmcrypt layer doesn't simplify things. There was a guy > here, Mark MERLIN, worked for google I believe and was on the road > frequently, that was using btrfs on dmcrypt for his laptop and various > btrfs on his servers as well -- he wrote some of the raid56 mode stuff on > the wiki based on his own experiments with it. But I haven't seen him > around recently. I'd suggest he'd be the guy to talk to about btrfs on > dmcrypt if you can get in contact with him, as he seemed to have more > experience with it than anyone else around here. But like I said I > haven't seen him around recently... > > Put it this way. If it were my data on the line, I'd either (1) use > another filesystem on top of dmcrypt, if I really wanted/needed the > crypted layer, or (2) do without the crypted layer, or (3) use btrfs but > be extra vigilant with backups. This since while I know of no specific > bugs in btrfs-on-dmcrypt case, I don't particularly trust it either, and > Marc MERLIN's posted troubles with the combo were enough to have me > avoiding it if possible, and being extra careful with backups if not. As far as dm-crypt goes, it looks like BTRFS is stable on top in the configuration I use (aex-xts-plain64 with a long key using plain dm-crypt instead of LUKS). I have heard rumors of issues when using LUKS without hardware acceleration, but I've never seen any conclusive proof, and what little I've heard sounds more like it was just race conditions elsewhere causing the issues. > >> If I can copy this entire root filesystem, what is the best way to do >> it? The btrfs restore tool? cp? rsync? Some cloning tool? Other options? > > It depends on if the filesystem is mountable and if so, how much can be > retrieved without error, the latter of which depends on the extent of > that metadata damage, since damaged metadata will likely take out > multiple files, and depending on what level of the tree the damage was > on, it could take out only a few files, or most of the filesystem! > > If you can mount and the damage appears to be limited, I'd try mounting > read-only and copying what I could off, using conventional methods. That > way you get checksum protection, which should help assure that anything > successfully copied isn't corrupted, because btrfs will error out if > there's checksum errors and it won't copy successfully. > > If it won't mount or it will but the damage appears to be extensive, I'd > suggest using restore. It's read-only in terms of the filesystem it's > restoring from, so shouldn't cause further damage -- unless the device is > actively decaying as you use it, in which case the first thing I'd try to > do is image it to something else so the damage isn't getting worse as you > work with it. > > But AFAIK restore doesn't give you the checksum protection, so anything > restored that way /could/ be corrupt (tho it's worth noting that ordinary > filesystems don't do checksum protection anyway, so it's important not to > consider the file any more damaged just because it wasn't checksum > protected than it would be if you simply retrieved it from say an ext4 > filesystem and didn't have some other method to verify the file). > > Altho... working on dmcrypt, I suppose it's likely that anything that's > corrupted turns up entirely scrambled and useless anyway -- you may not > be able to retrieve for example a video file with some dropouts as may be > the case on unencrypted storage, but have a totally scrambled and useless > file, or at least that file block (4K), instead. This may or may not be the case, it really depends on how dm-crypt is set up, and a bunch of other factors. The chance of this happening is higher with dm-crypt, but it's still not a certainty. > >> If I use the btrfs restore tool, should I use options x, m and S? In >> particular I wonder exactly what the S option does. If I leave S out, >> are all symlinks ignored? > > Symlinks are not restored without -S, correct. That and -m are both > relatively new restore options -- back when I first used restore you > simply didn't get that back. > > If it's primarily just data files and you don't really care about > ownership/permissions or date metadata, you can leave the -m off to > simplify the process slightly. In that case, the files will be written > just as any other new file would be written, as the user (root) the app > is running as, subject to the current umask. Else use the -m and restore > will try to restore ownership/permissions/dates metadata as well. > > Similarly, you may or may not need -x for the extended attributes. > Unless you're using selinux and its security attributes, or capacities to > avoid running as superuser (and those both apply primarily to > executables), chances are fairly good that unless you specifically know > you need extended attributes restored, you don't, and can skip that > option. There are a few other cases where they are important, but most of them are big data-center type things. The big one I can think of off the top of my head is when using GlusterFS on top of BTRFS, as Gluster stores synchronization info in xattrs. I'm pretty certain Ceph does too. In general though, if it's just a workstation, you probably don't need xattrs unless you use a security module (like SELinux, IMA, or EVM), file capabilities (ping almost certainly does on your system, but I doubt anything else does, and ping won't break without them), or are using ACL's (or Samba, it stores Windows style ACE's in xattrs, but it doesn't do so by default, and setting that up right is complicated). If you can afford to wait a bit longer, it's probably better to use -x, because most of the things that break in the face of missing xattrs tend to break rather spectacularly. > >> I'm trying to save time and clone this so that I get the operating >> system and all my tweaks / configurations back. As I said, the really >> important data is separately backed up. > > Good. =:^) > > Sounds about like me. I do periodic backups, but have run restore a > couple times when a filesystem wouldn't mount, in ordered to get back as > much of the delta between the last backup and current as possible. Of > course I know not doing more frequent backups is a calculated risk and I > was prepared to have to redo anything changed since the backup if > necessary, but it's nice to have a tool like btrfs restore that can make > it unnecessary under certain conditions where it otherwise would be. =:^) > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-10 6:27 ` Duncan 2016-08-10 19:46 ` Austin S. Hemmelgarn @ 2016-08-10 21:21 ` Chris Murphy 2016-08-10 22:01 ` Dave T 2016-08-12 17:00 ` Patrik Lundquist 1 sibling, 2 replies; 28+ messages in thread From: Chris Murphy @ 2016-08-10 21:21 UTC (permalink / raw) Cc: Btrfs BTRFS I'm using LUKS, aes xts-plain64, on six devices. One is using mixed-bg single device. One is dsingle mdup. And then 2x2 mraid1 draid1. I've had zero problems. The two computers these run on do have aesni support. Aging wise, they're all at least a year old. But I've been using Btrfs on LUKS for much longer than that. Chris Murphy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-10 21:21 ` Chris Murphy @ 2016-08-10 22:01 ` Dave T 2016-08-10 22:23 ` Chris Murphy 2016-08-11 4:50 ` Duncan 2016-08-12 17:00 ` Patrik Lundquist 1 sibling, 2 replies; 28+ messages in thread From: Dave T @ 2016-08-10 22:01 UTC (permalink / raw) To: Chris Murphy, Duncan, ahferroin7; +Cc: Btrfs BTRFS Thanks for all the responses, guys! I really appreciate it. This information is very helpful. I will be working through the suggestions (e.g., check without repair) for the next hour or so. I'll report back when I have something to report. My drive is a Samsung 950 Pro nvme drive, which in most respects is treated like an SSD. (the only difference I am aware of is that trim isn't needed). > But until recently dup mode data on single device was impossible, so I > doubt you were using that, and while dup mode metadata was the normal > default, on ssd that changes to single mode as well. Your assumptions are correct: single mode for data and metadata. Does anyone have any thoughts about using dup mode for metadata on a Samsung 950 Pro (or any NVMe drive)? I will be very disappointed if I cannot use btrfs + dm-crypt. As far as I can see, there is no alternative given that I need to use snapshots (and LVM, as good as it is, has severe performance penalties for its snapshots). I'm required to use crypto. I cannot risk doing without snapshots. Therefore, btrfs + dm-crypt seem like my only viable solution. Plus it is my preferred solution. I like both tools. If all goes well, we are planning to implement a production file server for our office with dm-crypt + btrfs (and a lot fo spinning disks). In the office we currently have another system identical to mine running the same drive with dm-crypt + btrfs, the same operating system, the same nvidia GPU and properitary driver and it is running fine. One difference is that it is overclocked substantially (mine isn't). I would have expected it would give a problem before mine would. But it seems to be rock solid. I just ran btrfs scrub on it and it finished in a few seconds with no errors. On my computer I have run two extensive memory tests (8 cpu cores in parallel, all tests). The current test has been running for 14 hrs with no errors. (I think that 8 cores in parallel make this equivalent to a much longer test with the default single cpu settings.) Therefore, I do not beieve this issue is caused by RAM. I'm hoping there is no configuration error or other mistake I made in setting these systems up that would lead to the problems I'm experiencing. BTW, I was able to copy all the files to another drive with no problem. I used "cp -a" to copy, then I ran "rsync -a" twiice to make sure nothing was missed. My guess is that I'll be able to copy this right back onto the root filesystem after I resolve whatever the problem is and my operating system will be back to the same state it was in prior to this problem. OK, I'm off to try btrfs check without --repair... thanks again! For reference: btrfs-progs v4.6.1 Linux 4.6.4-1-ARCH #1 SMP PREEMPT Mon Jul 11 19:12:32 CEST 2016 x86_64 GNU/Linux On Wed, Aug 10, 2016 at 5:21 PM, Chris Murphy <lists@colorremedies.com> wrote: > I'm using LUKS, aes xts-plain64, on six devices. One is using mixed-bg > single device. One is dsingle mdup. And then 2x2 mraid1 draid1. I've > had zero problems. The two computers these run on do have aesni > support. Aging wise, they're all at least a year old. But I've been > using Btrfs on LUKS for much longer than that. > > > Chris Murphy > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-10 22:01 ` Dave T @ 2016-08-10 22:23 ` Chris Murphy 2016-08-10 22:52 ` Dave T 2016-08-11 7:18 ` Andrei Borzenkov 2016-08-11 4:50 ` Duncan 1 sibling, 2 replies; 28+ messages in thread From: Chris Murphy @ 2016-08-10 22:23 UTC (permalink / raw) To: Dave T; +Cc: Chris Murphy, Duncan, Austin Hemmelgarn, Btrfs BTRFS On Wed, Aug 10, 2016 at 4:01 PM, Dave T <davestechshop@gmail.com> wrote: > I will be very disappointed if I cannot use btrfs + dm-crypt. As far > as I can see, there is no alternative given that I need to use > snapshots (and LVM, as good as it is, has severe performance penalties > for its snapshots). See LVM thin provisioning snapshots. I haven't benchmarked it, but it's a night and day difference from conventional (thick) snapshots. The gotchas are currently there's no raid support, and the snapshots are whole volume. So each snapshot appears as a volume with the same UUID as the original, and by default they're not active. So for me it's a bit of a head scratcher what happens when mounting a snapshot concurrent with another. For Btrfs this ends badly. For XFS it refuses unless using nouuid, but still seems capable of writing to the two volumes without causing problems. But yes, I like Btrfs snapshots and refinks better. *shrug* If you find a Btrfs on dmcrypt problem, it's a serious bug, and I think it would get attention very quickly. -- Chris Murphy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-10 22:23 ` Chris Murphy @ 2016-08-10 22:52 ` Dave T 2016-08-11 14:12 ` Nicholas D Steeves 2016-08-11 7:18 ` Andrei Borzenkov 1 sibling, 1 reply; 28+ messages in thread From: Dave T @ 2016-08-10 22:52 UTC (permalink / raw) To: Chris Murphy; +Cc: Duncan, Austin Hemmelgarn, Btrfs BTRFS Apologies. I have to make a correction to the message I just sent. Disregard that message and use this one: On Wed, Aug 10, 2016 at 5:15 PM, Chris Murphy <lists@colorremedies.com> wrote: > 1. Report 'btrfs check' without --repair, let's see what it complains > about and if it might be able to plausibly fix this. First, a small part of the dmesg output: [ 172.772283] Btrfs loaded [ 172.772632] BTRFS: device label top_level devid 1 transid 103495 /dev/dm-0 [ 274.320762] BTRFS info (device dm-0): use lzo compression [ 274.320764] BTRFS info (device dm-0): disk space caching is enabled [ 274.320764] BTRFS: has skinny extents [ 274.322555] BTRFS info (device dm-0): bdev /dev/mapper/cryptroot errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 [ 274.329965] BTRFS: detected SSD devices, enabling SSD mode Now, full output of btrfs check without repair option. checking extents bad metadata [292414541824, 292414558208) crossing stripe boundary bad metadata [292414607360, 292414623744) crossing stripe boundary bad metadata [292414672896, 292414689280) crossing stripe boundary bad metadata [292414738432, 292414754816) crossing stripe boundary bad metadata [292415787008, 292415803392) crossing stripe boundary bad metadata [292415918080, 292415934464) crossing stripe boundary bad metadata [292416376832, 292416393216) crossing stripe boundary bad metadata [292418015232, 292418031616) crossing stripe boundary bad metadata [292419325952, 292419342336) crossing stripe boundary bad metadata [292419588096, 292419604480) crossing stripe boundary bad metadata [292419915776, 292419932160) crossing stripe boundary bad metadata [292422930432, 292422946816) crossing stripe boundary bad metadata [292423061504, 292423077888) crossing stripe boundary ref mismatch on [292423155712 16384] extent item 1, found 0 Backref 292423155712 root 258 not referenced back 0x2280a20 Incorrect global backref count on 292423155712 found 1 wanted 0 backpointer mismatch on [292423155712 16384] owner ref check failed [292423155712 16384] bad metadata [292423192576, 292423208960) crossing stripe boundary bad metadata [292423323648, 292423340032) crossing stripe boundary bad metadata [292429549568, 292429565952) crossing stripe boundary bad metadata [292439904256, 292439920640) crossing stripe boundary bad metadata [292440297472, 292440313856) crossing stripe boundary bad metadata [292442525696, 292442542080) crossing stripe boundary bad metadata [292443770880, 292443787264) crossing stripe boundary bad metadata [292443967488, 292443983872) crossing stripe boundary bad metadata [292444033024, 292444049408) crossing stripe boundary bad metadata [292444098560, 292444114944) crossing stripe boundary bad metadata [292444164096, 292444180480) crossing stripe boundary bad metadata [292444229632, 292444246016) crossing stripe boundary bad metadata [292444688384, 292444704768) crossing stripe boundary bad metadata [292444884992, 292444901376) crossing stripe boundary bad metadata [292445081600, 292445097984) crossing stripe boundary bad metadata [292446720000, 292446736384) crossing stripe boundary bad metadata [292448948224, 292448964608) crossing stripe boundary Error: could not find btree root extent for root 258 Checking filesystem on /dev/mapper/cryptroot ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-10 22:52 ` Dave T @ 2016-08-11 14:12 ` Nicholas D Steeves 2016-08-11 14:45 ` Austin S. Hemmelgarn ` (2 more replies) 0 siblings, 3 replies; 28+ messages in thread From: Nicholas D Steeves @ 2016-08-11 14:12 UTC (permalink / raw) To: Dave T; +Cc: Chris Murphy, Duncan, Austin Hemmelgarn, Btrfs BTRFS Why is the combination of dm-crypt|luks+btrfs+compress=lzo as overlooked as a potential cause? Other than the "raid56 ate my data" I've noticed a bunch of "luks+btrfs+compress=lzo ate my data" threads. On 10 August 2016 at 15:46, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > > As far as dm-crypt goes, it looks like BTRFS is stable on top in the > configuration I use (aex-xts-plain64 with a long key using plain dm-crypt > instead of LUKS). I have heard rumors of issues when using LUKS without > hardware acceleration, but I've never seen any conclusive proof, and what > little I've heard sounds more like it was just race conditions elsewhere > causing the issues. > Austin, I'm very curious if they were also using compress=lzo, because my informal hypothesis is that the encryption+btrfs+compress=lzo combination precipitates these issues. Maybe the combo is more likely to trigger these race conditions? It might also be neat to mine the archive to see these seem to be more likely to occur with fast SSDs vs slow rotational disks. Do you use compress=lzo? On 10 August 2016 at 18:52, Dave T <davestechshop@gmail.com> wrote: > On Wed, Aug 10, 2016 at 5:15 PM, Chris Murphy <lists@colorremedies.com> wrote: > >> 1. Report 'btrfs check' without --repair, let's see what it complains >> about and if it might be able to plausibly fix this. > > First, a small part of the dmesg output: > > [ 172.772283] Btrfs loaded > [ 172.772632] BTRFS: device label top_level devid 1 transid 103495 /dev/dm-0 > [ 274.320762] BTRFS info (device dm-0): use lzo compression Compress=lzo confirmed. Corruption occurred on an SSD. On 10 August 2016 at 17:21, Chris Murphy <lists@colorremedies.com> wrote: > I'm using LUKS, aes xts-plain64, on six devices. One is using mixed-bg > single device. One is dsingle mdup. And then 2x2 mraid1 draid1. I've > had zero problems. The two computers these run on do have aesni > support. Aging wise, they're all at least a year old. But I've been > using Btrfs on LUKS for much longer than that. > Chris, do you use compress=lzo? SSDs or rotational disks? If a bunch of people are using this combo without issue, I'll drop the informal hypothesis as "just a suspicion informed by sloppy pattern recognition" ;-) Thank you! Nicholas ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-11 14:12 ` Nicholas D Steeves @ 2016-08-11 14:45 ` Austin S. Hemmelgarn 2016-08-11 19:07 ` Duncan 2016-08-11 20:33 ` Chris Murphy 2 siblings, 0 replies; 28+ messages in thread From: Austin S. Hemmelgarn @ 2016-08-11 14:45 UTC (permalink / raw) To: Nicholas D Steeves, Dave T; +Cc: Chris Murphy, Duncan, Btrfs BTRFS On 2016-08-11 10:12, Nicholas D Steeves wrote: > Why is the combination of dm-crypt|luks+btrfs+compress=lzo as > overlooked as a potential cause? Other than the "raid56 ate my data" > I've noticed a bunch of "luks+btrfs+compress=lzo ate my data" threads. I haven't personally seen one of those in at least a few months. In general, BTRFS is moving fast enough that reports older than a kernel release cycle are generally out of date unless something confirms otherwise, but I do distinctly recall such issues being commonly reported in the past. > > On 10 August 2016 at 15:46, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: >> >> As far as dm-crypt goes, it looks like BTRFS is stable on top in the >> configuration I use (aex-xts-plain64 with a long key using plain dm-crypt >> instead of LUKS). I have heard rumors of issues when using LUKS without >> hardware acceleration, but I've never seen any conclusive proof, and what >> little I've heard sounds more like it was just race conditions elsewhere >> causing the issues. >> > > Austin, I'm very curious if they were also using compress=lzo, because > my informal hypothesis is that the encryption+btrfs+compress=lzo > combination precipitates these issues. Maybe the combo is more likely > to trigger these race conditions? It might also be neat to mine the > archive to see these seem to be more likely to occur with fast SSDs vs > slow rotational disks. Do you use compress=lzo? In my case, I've tested on both SSD's (both cheap low-end ones and good Intel and Crucial ones) and traditional hard drives, with and without compression (both zlib and lzo), and with a couple of different encryption algorithms (AES, Blowfish, and Threefish). In my case It's only on plain dm-crypt, not LUKS, but I doubt that particular point will make much difference. The last test I did was when the merge window for 4.6 closed run as part of the regular regression testing I do, and I'll be doing another one in the near future. I think the last time I saw any issues with this in my testing was prior to 4.0, but I don't remember for sure (most of what I care about is comparison to the previous version, so i don't keep much in the way of records of specific things). ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-11 14:12 ` Nicholas D Steeves 2016-08-11 14:45 ` Austin S. Hemmelgarn @ 2016-08-11 19:07 ` Duncan 2016-08-11 20:43 ` Chris Murphy 2016-08-11 20:33 ` Chris Murphy 2 siblings, 1 reply; 28+ messages in thread From: Duncan @ 2016-08-11 19:07 UTC (permalink / raw) To: linux-btrfs Nicholas D Steeves posted on Thu, 11 Aug 2016 10:12:04 -0400 as excerpted: > Why is the combination of dm-crypt|luks+btrfs+compress=lzo as overlooked > as a potential cause? Other than the "raid56 ate my data" I've noticed > a bunch of "luks+btrfs+compress=lzo ate my data" threads. My usage is btrfs on physical device (well, on GPT partitions on the physical device), no encryption, and it's mostly raid1 on paired devices, but there's definitely one kink that compress=lzo (and I believe compression in general, including gzip) adds, and it's possible running it on encryption compounds the issue. The compression-related problem is this: Btrfs is considerably less tolerant of checksum-related errors on btrfs-compressed data, and while on uncompressed btrfs raid1 it will recover from the second copy where possible and continue, on files that btrfs has compressed, if there are enough checksum errors, for example in a hard-shutdown situation where one of the raid1 devices had the updates written but it crashed while writing the other, btrfs will crash instead of simply falling back to the good copy. This is known to be specific to compression; uncompressed btrfs recover as intended from the second copy. And it's known to occur only when there's too many checksum errors in a burst -- the filesystem apparently deals correctly with just a few at a time. This problem has been ongoing for years -- I thought it was just the way btrfs worked until someone mentioned that it didn't behave that way without compression -- and it reasonably regularly prevents a smooth reboot here after a crash. In my case I have the system btrfs running read-only by default, so it's not damaged. However, /home and /var/log are of course mounted writable, and that's where the problems come in. If I start in (I believe) rescue mode (it's that or emergency, the other won't do the mounts and won't let me do them manually either, as it thinks a dependency is missing), systemd will do the mounts but not start the (permanent) logging or the services that need to routinely write stuff that I have symlinked into /home/var/whatever so they can write with a read-only root and system partition, I can then scrub the mounted home and log partitions to fix the checksum errors due to one device having the update while the other doesn't, and continue booting normally. However, if I try directly booting normally, the system invariably crashes due to too many checksum errors, even when it /should/ simply read the other copy, which is fine as demonstrated by the fact that scrub can use it to fix the errors on the device triggering the checksum errors. This continued to happen with 4.6. I'm on 4.7 now but am not sure I've crashed with it and thus can't say for sure whether the problem is fixed there. However, I doubt it, as the problem has been there apparently since the compression and raid1 features were introduced, and I didn't see anything mentioning a fix for the issue in the patches going by on the list. The problem is most obvious and reproducible in btrfs raid1 mode, since there, one device /can/ be behind the other, and scrub /can/ be demonstrated to fix it so it's obviously a checksum issue, but I'd imagine if enough checksum mismatches happen on a single device in single mode, it would crash as well, and of course then there's no second copy for scrub to fix the bad copy from, so it would simply show up as a btrfs that can mount but with significant corruption issues that will crash the system if an attempt to read the affected blocks reads too many at a time. And to whatever possible extent an encryption layer between the physical device and btrfs results in possible additional corruption in the event of a crash or hard shutdown, it could easily compound an already bad situation. Meanwhile, /if/ that does turn out to be the root issue here, then finally fixing the btrfs compression related problem where a large burst of checksum failures crashes the system, even when there provably exists a second valid copy, but where this only happens with compression, should go quite far in stabilizing btrfs on encrypted underlayers. I know I certainly wouldn't object to the problem being fixed. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-11 19:07 ` Duncan @ 2016-08-11 20:43 ` Chris Murphy 2016-08-12 3:11 ` Duncan 0 siblings, 1 reply; 28+ messages in thread From: Chris Murphy @ 2016-08-11 20:43 UTC (permalink / raw) To: Duncan; +Cc: Btrfs BTRFS On Thu, Aug 11, 2016 at 1:07 PM, Duncan <1i5t5.duncan@cox.net> wrote: > The compression-related problem is this: Btrfs is considerably less > tolerant of checksum-related errors on btrfs-compressed data, Why? The data is the data. And why would it matter if it's application compressed data vs Btrfs compressed data? If there's an error, Btrfs is intolerant. I don't see how there's a checksum error that Btrfs tolerates. But also I don't know if the checksum is predicated on compressed data or uncompressed data - does the scrub blindly read compressed data, checksums it, and compares to the previously recorded csum? Or does the scrub read compressed data, decompresses it, checksums it, then compares? And does compression compress metadata? I don't think it does from some of the squashfs testing of the same set of binary files on ext4 vs btrfs uncompressed vs btrfs compressed. The difference is explained by inline data being compressed (which it is), so I don't think the fs itself gets compressed. Chris Murphy -- Chris Murphy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-11 20:43 ` Chris Murphy @ 2016-08-12 3:11 ` Duncan 2016-08-12 3:51 ` Chris Murphy 0 siblings, 1 reply; 28+ messages in thread From: Duncan @ 2016-08-12 3:11 UTC (permalink / raw) To: linux-btrfs Chris Murphy posted on Thu, 11 Aug 2016 14:43:56 -0600 as excerpted: > On Thu, Aug 11, 2016 at 1:07 PM, Duncan <1i5t5.duncan@cox.net> wrote: >> The compression-related problem is this: Btrfs is considerably less >> tolerant of checksum-related errors on btrfs-compressed data, > > Why? The data is the data. And why would it matter if it's application > compressed data vs Btrfs compressed data? If there's an error, Btrfs is > intolerant. I don't see how there's a checksum error that Btrfs > tolerates. Apparently, the code path for compressed data is sufficiently different, that when there's a burst of checksum errors, even on raid1 where it should (and does with scrub) get the correct second copy, it will crash the system. This is my experience and that of others, and what I thought was standard btrfs behavior -- I didn't know it was a compression- specific bug since I use compress on all my btrfs, until someone told me. When the btrfs compression option hasn't been used on that filesystem, or presumably when none of that burst of checksum errors is from btrfs- compressed files, it will grab the second copy and use it as it should, and there will be no crash. This is as reported by others, including people who have tested both with and without btrfs-compressed files and found that it only crashed if the files were btrfs-compressed, whereas it worked as expected, fetching the valid second copy, if they weren't btrfs- compressed. I'd assume this is why this particular bug has remained unsquashed for so long. The devs are likely testing compression, and bad checksum data repair from the second copy, but they probably aren't testing bad checksum repair on compressed data, so the problem isn't showing up in their tests. Between that and relatively few people running raid1 with the compression option and seeing enough bad shutdowns to be aware of the problem, it has mostly flown under the radar. For a long time I myself thought it was just the way btrfs behaved with bursts of checksum errors, until someone pointed out that it did /not/ behave that way on btrfs that didn't have any compressed files when the checksum errors occurred. > But also I don't know if the checksum is predicated on compressed data > or uncompressed data - does the scrub blindly read compressed data, > checksums it, and compares to the previously recorded csum? Or does the > scrub read compressed data, decompresses it, checksums it, then > compares? And does compression compress metadata? I don't think it does > from some of the squashfs testing of the same set of binary files on > ext4 vs btrfs uncompressed vs btrfs compressed. The difference is > explained by inline data being compressed (which it is), so I don't > think the fs itself gets compressed. As I'm not a coder I can't actually tell you from reading the code, but AFAIK, both the 128 KiB compression block size and the checksum are on the uncompressed data. Compression takes place after checksumming. And I don't believe metadata, whether metadata itself or inline data, is compressed by btrfs' transparent compression. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-12 3:11 ` Duncan @ 2016-08-12 3:51 ` Chris Murphy 0 siblings, 0 replies; 28+ messages in thread From: Chris Murphy @ 2016-08-12 3:51 UTC (permalink / raw) To: Btrfs BTRFS On Thu, Aug 11, 2016 at 9:11 PM, Duncan <1i5t5.duncan@cox.net> wrote: > Chris Murphy posted on Thu, 11 Aug 2016 14:43:56 -0600 as excerpted: > >> On Thu, Aug 11, 2016 at 1:07 PM, Duncan <1i5t5.duncan@cox.net> wrote: >>> The compression-related problem is this: Btrfs is considerably less >>> tolerant of checksum-related errors on btrfs-compressed data, >> >> Why? The data is the data. And why would it matter if it's application >> compressed data vs Btrfs compressed data? If there's an error, Btrfs is >> intolerant. I don't see how there's a checksum error that Btrfs >> tolerates. > > Apparently, the code path for compressed data is sufficiently different, > that when there's a burst of checksum errors, even on raid1 where it > should (and does with scrub) get the correct second copy, it will crash > the system. Ahh OK, gotcha. > This is my experience and that of others, and what I thought > was standard btrfs behavior -- I didn't know it was a compression- > specific bug since I use compress on all my btrfs, until someone told me. > > When the btrfs compression option hasn't been used on that filesystem, or > presumably when none of that burst of checksum errors is from btrfs- > compressed files, it will grab the second copy and use it as it should, > and there will be no crash. This is as reported by others, including > people who have tested both with and without btrfs-compressed files and > found that it only crashed if the files were btrfs-compressed, whereas it > worked as expected, fetching the valid second copy, if they weren't btrfs- > compressed. OK so something's broken. > > As I'm not a coder I can't actually tell you from reading the code, but > AFAIK, both the 128 KiB compression block size and the checksum are on > the uncompressed data. Compression takes place after checksumming. > > And I don't believe metadata, whether metadata itself or inline data, is > compressed by btrfs' transparent compression. Inline data is definitely compressed. >From ls -li 263 -rw-r-----. 1 root root 3270 Aug 11 21:29 samsung840-256g-hdparm.txt >From btrfs-debug-tree item 84 key (263 INODE_ITEM 0) itemoff 7618 itemsize 160 inode generation 7 transid 7 size 3270 nbytes 3270 block group 0 mode 100640 links 1 uid 0 gid 0 rdev 0 flags 0x0 item 85 key (263 INODE_REF 256) itemoff 7582 itemsize 36 inode ref index 8 namelen 26 name: samsung840-256g-hdparm.txt item 86 key (263 XATTR_ITEM 3817753667) itemoff 7499 itemsize 83 location key (0 UNKNOWN.0 0) type XATTR namelen 16 datalen 37 name: security.selinux data unconfined_u:object_r:unlabeled_t:s0 item 87 key (263 EXTENT_DATA 0) itemoff 5860 itemsize 1639 inline extent data size 1618 ram 3270 compress(zlib) Curiously though, these same small text files once above a certain size (?) are not compressed if they aren't inline extents. 278 -rw-r-----. 1 root root 11767 Aug 11 21:29 WDCblack-750g-smartctlx_2.txt item 48 key (278 INODE_ITEM 0) itemoff 7675 itemsize 160 inode generation 7 transid 7 size 11767 nbytes 12288 block group 0 mode 100640 links 1 uid 0 gid 0 rdev 0 flags 0x0 item 49 key (278 INODE_REF 256) itemoff 7636 itemsize 39 inode ref index 23 namelen 29 name: WDCblack-750g-smartctlx_2.txt item 50 key (278 XATTR_ITEM 3817753667) itemoff 7553 itemsize 83 location key (0 UNKNOWN.0 0) type XATTR namelen 16 datalen 37 name: security.selinux data unconfined_u:object_r:unlabeled_t:s0 item 51 key (278 EXTENT_DATA 0) itemoff 7500 itemsize 53 extent data disk byte 12939264 nr 4096 extent data offset 0 nr 12288 ram 12288 extent compression(zlib) Hrrmm. -- Chris Murphy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-11 14:12 ` Nicholas D Steeves 2016-08-11 14:45 ` Austin S. Hemmelgarn 2016-08-11 19:07 ` Duncan @ 2016-08-11 20:33 ` Chris Murphy 2 siblings, 0 replies; 28+ messages in thread From: Chris Murphy @ 2016-08-11 20:33 UTC (permalink / raw) To: Nicholas D Steeves Cc: Dave T, Chris Murphy, Duncan, Austin Hemmelgarn, Btrfs BTRFS On Thu, Aug 11, 2016 at 8:12 AM, Nicholas D Steeves <nsteeves@gmail.com> wrote: > > Chris, do you use compress=lzo? SSDs or rotational disks? No compression, SSD and HDD. The stuff I care about is on dmcrypt (LUKS) for some time. Stuff I sorta care about on plain partitions. Stuff I don't care much about are either on LVM LV's (usually thinp), or qcow2. I have used compression for periods measured in months not years, both zlib and lzo, on both SSD and HDD, to no ill effect. But it's true some of the more abrupt and worse damaged file systems did use compress=lzo. Since lzo is faster and only a bit less better compression than zlib, it may be more people choose lzo and that's why it turns out if there's a problem with compression it happens to be lzo, coincidence rather than causation. I'm not even sure there's enough information to have correlation. -- Chris Murphy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-10 22:23 ` Chris Murphy 2016-08-10 22:52 ` Dave T @ 2016-08-11 7:18 ` Andrei Borzenkov 1 sibling, 0 replies; 28+ messages in thread From: Andrei Borzenkov @ 2016-08-11 7:18 UTC (permalink / raw) To: Chris Murphy; +Cc: Dave T, Duncan, Austin Hemmelgarn, Btrfs BTRFS On Thu, Aug 11, 2016 at 1:23 AM, Chris Murphy <lists@colorremedies.com> wrote: > On Wed, Aug 10, 2016 at 4:01 PM, Dave T <davestechshop@gmail.com> wrote: > >> I will be very disappointed if I cannot use btrfs + dm-crypt. As far >> as I can see, there is no alternative given that I need to use >> snapshots (and LVM, as good as it is, has severe performance penalties >> for its snapshots). > > See LVM thin provisioning snapshots. I haven't benchmarked it, but > it's a night and day difference from conventional (thick) snapshots. > The gotchas are currently there's no raid support, and the snapshots > are whole volume. So each snapshot appears as a volume with the same > UUID as the original, and by default they're not active. So for me > it's a bit of a head scratcher what happens when mounting a snapshot > concurrent with another. For Btrfs this ends badly. For XFS it refuses > unless using nouuid, but still seems capable of writing to the two > volumes without causing problems. > XFS now allows changing UUID, as do LVM and MD. We can also change btrfs UUID using "btrfstune -u", but I wonder if there is any way to change device UUID in this case. One problem is that even before you come around doing it various udev rules kick in and create links to wrong instance overwriting previous ones; and I'm not sure either xfs_admin or btrfstune trigger change event. So we may end up with stale completely wrong links. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-10 22:01 ` Dave T 2016-08-10 22:23 ` Chris Murphy @ 2016-08-11 4:50 ` Duncan 2016-08-11 5:06 ` Gareth Pye 1 sibling, 1 reply; 28+ messages in thread From: Duncan @ 2016-08-11 4:50 UTC (permalink / raw) To: linux-btrfs Dave T posted on Wed, 10 Aug 2016 18:01:44 -0400 as excerpted: > Does anyone have any thoughts about using dup mode for metadata on a > Samsung 950 Pro (or any NVMe drive)? The biggest problem with dup on ssds is that some ssds (particularly the ones with the sandforce controllers) do dedup, so you'd be having btrfs do dup while the filesystem dedups, to no effect except more cpu and device processing! (The other argument for single on ssd that I've seen is that because the FTL ultimately places the data, and because both copies are written at the same time, there's a good chance that the FTL will write them into the same erase block and area, and a defect in one will likely be a defect in the other as well. That may or may not be, I'm not qualified to say, but as explained below, I do choose to take my chances on that and thus do run dup on ssd.) So as long as the SSD doesn't have a deduping FTL, I'd suggest dup for metadata on ssd does make sense. Data... not so sure on, but certainly metadata, because one bad block of metadata can be many messed up files. On my ssds here, which I know don't do dedup, most of my btrfs are raid1 on the pair of ssds. However, /boot is different since I can't really point grub at two different /boots, so I have my working /boot on one device, with the backup /boot on the other, and the grub on each one pointed at its respective /boot, so I can select working or backup /boot from the BIOS and it'll just work. Since /boot is so small, it's mixed- mode chunks, meaning data and metadata are mixed together and the redundancy mode applies to both at once instead of each separately. And I chose dup, so it's dup for both data and metadata. Works fine, dup for both data and metadata on non-deduping ssds, but of course that means data takes double the space since there's two copies of it, and that gets kind of expensive on ssd, if it's more than the fraction of a GiB that's /boot. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-11 4:50 ` Duncan @ 2016-08-11 5:06 ` Gareth Pye 2016-08-11 8:20 ` Duncan 0 siblings, 1 reply; 28+ messages in thread From: Gareth Pye @ 2016-08-11 5:06 UTC (permalink / raw) To: Duncan; +Cc: linux-btrfs Is there some simple muddling of meta data that could be done to force dup meta data on deduping SSDs? Like a simple 'random' byte repeated often enough it would defeat any sane dedup? I know it would waste data but clearly that is considered worth it with dup metadata (what is the difference between 50% metadata efficiency and 45%?) On Thu, Aug 11, 2016 at 2:50 PM, Duncan <1i5t5.duncan@cox.net> wrote: > Dave T posted on Wed, 10 Aug 2016 18:01:44 -0400 as excerpted: > >> Does anyone have any thoughts about using dup mode for metadata on a >> Samsung 950 Pro (or any NVMe drive)? > > The biggest problem with dup on ssds is that some ssds (particularly the > ones with the sandforce controllers) do dedup, so you'd be having btrfs > do dup while the filesystem dedups, to no effect except more cpu and > device processing! > > (The other argument for single on ssd that I've seen is that because the > FTL ultimately places the data, and because both copies are written at > the same time, there's a good chance that the FTL will write them into > the same erase block and area, and a defect in one will likely be a > defect in the other as well. That may or may not be, I'm not qualified > to say, but as explained below, I do choose to take my chances on that > and thus do run dup on ssd.) > > So as long as the SSD doesn't have a deduping FTL, I'd suggest dup for > metadata on ssd does make sense. Data... not so sure on, but certainly > metadata, because one bad block of metadata can be many messed up files. > > On my ssds here, which I know don't do dedup, most of my btrfs are raid1 > on the pair of ssds. However, /boot is different since I can't really > point grub at two different /boots, so I have my working /boot on one > device, with the backup /boot on the other, and the grub on each one > pointed at its respective /boot, so I can select working or backup /boot > from the BIOS and it'll just work. Since /boot is so small, it's mixed- > mode chunks, meaning data and metadata are mixed together and the > redundancy mode applies to both at once instead of each separately. And > I chose dup, so it's dup for both data and metadata. > > Works fine, dup for both data and metadata on non-deduping ssds, but of > course that means data takes double the space since there's two copies of > it, and that gets kind of expensive on ssd, if it's more than the > fraction of a GiB that's /boot. > > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gareth Pye - blog.cerberos.id.au Level 2 MTG Judge, Melbourne, Australia ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-11 5:06 ` Gareth Pye @ 2016-08-11 8:20 ` Duncan 0 siblings, 0 replies; 28+ messages in thread From: Duncan @ 2016-08-11 8:20 UTC (permalink / raw) To: linux-btrfs Gareth Pye posted on Thu, 11 Aug 2016 15:06:48 +1000 as excerpted: > Is there some simple muddling of meta data that could be done to force > dup meta data on deduping SSDs? Like a simple 'random' byte repeated > often enough it would defeat any sane dedup? I know it would waste data > but clearly that is considered worth it with dup metadata (what is the > difference between 50% metadata efficiency and 45%?) Well, the FTLs are mostly proprietary, AFAIK, so it's probably hard to prove the "force", but given the 512-byte sector standard (some are a multiple of that these days but 512 should be the minimum), in theory one random byte out of every 512 should do it... unless the compression these deduping FTLs generally run as well catches that difference and compresses it out to a different location where it can be compactly stored, allowing multiple copies of the same 512-byte sector to be stored in a single sector, so long as they only had a single byte or two different. So it could probably be done, but given that the deduping and compression features of these ssds are listed as just that, features, and that people buy them for that, it may be that it's better to simply leave well enough alone. Folks who want dup metadata can set it, and if they haven't bought one of these ssds with dedup as a feature, they can be reasonably sure it'll be set. And people who don't care will simply get the defaults and can live with them the same way that people that don't care generally live with defaults that may or may not be the absolute best case for them, but are generally at least not horrible. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-10 21:21 ` Chris Murphy 2016-08-10 22:01 ` Dave T @ 2016-08-12 17:00 ` Patrik Lundquist 1 sibling, 0 replies; 28+ messages in thread From: Patrik Lundquist @ 2016-08-12 17:00 UTC (permalink / raw) To: Btrfs BTRFS On 10 August 2016 at 23:21, Chris Murphy <lists@colorremedies.com> wrote: > > I'm using LUKS, aes xts-plain64, on six devices. One is using mixed-bg > single device. One is dsingle mdup. And then 2x2 mraid1 draid1. I've > had zero problems. The two computers these run on do have aesni > support. Aging wise, they're all at least a year old. But I've been > using Btrfs on LUKS for much longer than that. FWIW: I've had 5 spinning disks with LUKS + Btrfs raid1 for 1,5 years. Also xts-plain64 with AES-NI acceleration. No problems so far. Not using Btrfs compression. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-10 3:27 Dave T 2016-08-10 6:27 ` Duncan @ 2016-08-10 21:15 ` Chris Murphy 2016-08-10 22:50 ` Dave T 1 sibling, 1 reply; 28+ messages in thread From: Chris Murphy @ 2016-08-10 21:15 UTC (permalink / raw) To: Dave T; +Cc: Btrfs BTRFS On Tue, Aug 9, 2016 at 9:27 PM, Dave T <davestechshop@gmail.com> wrote: > btrfs scrub returned with uncorrectable errors. Searching in dmesg > returns the following information: > > BTRFS warning (device dm-0): checksum error at logical NNNNN on > /dev/mapper/[crypto] sector: yyyyy metadata node (level 2) in tree 250 > > it also says: > > unable to fixup (regular) error at logical NNNNNN on /dev/mapper/[crypto] > > > I assume I have a bad block device. Does that seem correct? The > important data is backed up. If it were persistently, blatantly bad, then the drive firmware would know about it, and would report a read error. If you're not seeing libata UNC errors, or the other way it manifests is with hard link resets due to inappropriate SCSI command timer default in the kernel, then it's probably some kind of SDC, torn or misdirected write, etc. If metadata is profile DUP, then scrub should fix it. If it's not, there's something else going on (or really bad luck). I'd like to believe that btrfs check can, or someday will, be able to do some kind of sanity check on a node that fails checksum, and fix it. If the node can be read but merely fails checksum isn't a really good reason for a file system to not give you access to its data, but yeah it kinda depends on what's in the node. It could contain up to a couple hundred items each of which point elsewhere. btrfs-debug-tree -b <block number reported by error at logical> <dev> might give some hint what's going on. I'd like to believe it'll be noisy and warn the checksum fails but still show the contents assuming the drive hands over the data on those sectors. > If I can copy this entire root filesystem, what is the best way to do > it? The btrfs restore tool? cp? rsync? Some cloning tool? Other > options? 0. Backup, that's done. 1. Report 'btrfs check' without --repair, let's see what it complains about and if it might be able to plausibly fix this. Since you can scrub, it means the file system mounts. Since the file system mounts, I would not look at restore to start out because it's tedious. I'd say you toss a coin over using btrfs send/receive, or btrfs check --repair to see if it fixes the node. These days it should be safe with relatively recent btrfs-progs so I'd say use a 4.6.x or 4.7 progs for this. And then the send/receive should be done with -v or maybe even -vv for both send and receive, along with --max-errors 0, which will permit unlimited errors but will report them rather than failing midstream. This will get you the bulk of the OS. If you're lucky, the node contains only a handful of relatively unimportant items, especially if they're files small enough to be stored inline the node, which will substantially reduce the number of errors as a result of a single node loss. The calculus on btrfs check --repair first then send receive, vs send/receive then if that fails fallback to btrfs check --repair, is mainly time. Maybe repair can fix it, maybe it makes things worse. Where send/receive might fail midstream without the node being fixed first, but it causes no additional problems. The 2nd is more conservative but takes more time if it turns out the send/receive fails, you then do repair, and then have to start the send/receive over from scratch again. (If it fails, you should delete or rename the bad subvolume on the receive side before starting another send). > If I use the btrfs restore tool, should I use options x, m and S? In > particular I wonder exactly what the S option does. If I leave S out, > are all symlinks ignored? I would only use restore for the files that are reported by send/receive as failed due to errors - assuming that even happens. Or since this is OS stuff, just reinstall the packages for the files affected by the bad node. -- Chris Murphy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: checksum error in metadata node - best way to move root fs to new drive? 2016-08-10 21:15 ` Chris Murphy @ 2016-08-10 22:50 ` Dave T 0 siblings, 0 replies; 28+ messages in thread From: Dave T @ 2016-08-10 22:50 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS see below On Wed, Aug 10, 2016 at 5:15 PM, Chris Murphy <lists@colorremedies.com> wrote: > 1. Report 'btrfs check' without --repair, let's see what it complains > about and if it might be able to plausibly fix this. First, a small part of the dmesg output: [ 172.772283] Btrfs loaded [ 172.772632] BTRFS: device label top_level devid 1 transid 103495 /dev/dm-0 [ 274.320762] BTRFS info (device dm-0): use lzo compression [ 274.320764] BTRFS info (device dm-0): disk space caching is enabled [ 274.320764] BTRFS: has skinny extents [ 274.322555] BTRFS info (device dm-0): bdev /dev/mapper/sysluks errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 [ 274.329965] BTRFS: detected SSD devices, enabling SSD mode Now, full output of btrfs check without repair option. checking extents bad metadata [292414541824, 292414558208) crossing stripe boundary bad metadata [292414607360, 292414623744) crossing stripe boundary bad metadata [292414672896, 292414689280) crossing stripe boundary bad metadata [292414738432, 292414754816) crossing stripe boundary bad metadata [292415787008, 292415803392) crossing stripe boundary bad metadata [292415918080, 292415934464) crossing stripe boundary bad metadata [292416376832, 292416393216) crossing stripe boundary bad metadata [292418015232, 292418031616) crossing stripe boundary bad metadata [292419325952, 292419342336) crossing stripe boundary bad metadata [292419588096, 292419604480) crossing stripe boundary bad metadata [292419915776, 292419932160) crossing stripe boundary bad metadata [292422930432, 292422946816) crossing stripe boundary bad metadata [292423061504, 292423077888) crossing stripe boundary ref mismatch on [292423155712 16384] extent item 1, found 0 Backref 292423155712 root 258 not referenced back 0x2280a20 Incorrect global backref count on 292423155712 found 1 wanted 0 backpointer mismatch on [292423155712 16384] owner ref check failed [292423155712 16384] bad metadata [292423192576, 292423208960) crossing stripe boundary bad metadata [292423323648, 292423340032) crossing stripe boundary bad metadata [292429549568, 292429565952) crossing stripe boundary bad metadata [292439904256, 292439920640) crossing stripe boundary bad metadata [292440297472, 292440313856) crossing stripe boundary bad metadata [292442525696, 292442542080) crossing stripe boundary bad metadata [292443770880, 292443787264) crossing stripe boundary bad metadata [292443967488, 292443983872) crossing stripe boundary bad metadata [292444033024, 292444049408) crossing stripe boundary bad metadata [292444098560, 292444114944) crossing stripe boundary bad metadata [292444164096, 292444180480) crossing stripe boundary bad metadata [292444229632, 292444246016) crossing stripe boundary bad metadata [292444688384, 292444704768) crossing stripe boundary bad metadata [292444884992, 292444901376) crossing stripe boundary bad metadata [292445081600, 292445097984) crossing stripe boundary bad metadata [292446720000, 292446736384) crossing stripe boundary bad metadata [292448948224, 292448964608) crossing stripe boundary Error: could not find btree root extent for root 258 Checking filesystem on /dev/mapper/cryptroot UUID: ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2016-08-15 11:33 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-08-11 20:23 checksum error in metadata node - best way to move root fs to new drive? Dave T 2016-08-12 4:13 ` Duncan 2016-08-12 8:14 ` Adam Borowski 2016-08-12 12:04 ` Austin S. Hemmelgarn 2016-08-12 15:06 ` Duncan 2016-08-15 11:33 ` Austin S. Hemmelgarn 2016-08-12 17:02 ` Chris Murphy -- strict thread matches above, loose matches on Subject: below -- 2016-08-10 3:27 Dave T 2016-08-10 6:27 ` Duncan 2016-08-10 19:46 ` Austin S. Hemmelgarn 2016-08-10 21:21 ` Chris Murphy 2016-08-10 22:01 ` Dave T 2016-08-10 22:23 ` Chris Murphy 2016-08-10 22:52 ` Dave T 2016-08-11 14:12 ` Nicholas D Steeves 2016-08-11 14:45 ` Austin S. Hemmelgarn 2016-08-11 19:07 ` Duncan 2016-08-11 20:43 ` Chris Murphy 2016-08-12 3:11 ` Duncan 2016-08-12 3:51 ` Chris Murphy 2016-08-11 20:33 ` Chris Murphy 2016-08-11 7:18 ` Andrei Borzenkov 2016-08-11 4:50 ` Duncan 2016-08-11 5:06 ` Gareth Pye 2016-08-11 8:20 ` Duncan 2016-08-12 17:00 ` Patrik Lundquist 2016-08-10 21:15 ` Chris Murphy 2016-08-10 22:50 ` Dave T
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.