From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1043BC433F5 for ; Thu, 7 Apr 2022 12:28:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245145AbiDGMaU (ORCPT ); Thu, 7 Apr 2022 08:30:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53252 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245139AbiDGM36 (ORCPT ); Thu, 7 Apr 2022 08:29:58 -0400 Received: from drax.kayaks.hungrycats.org (drax.kayaks.hungrycats.org [174.142.148.226]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 02066A88AD for ; Thu, 7 Apr 2022 05:27:55 -0700 (PDT) Received: by drax.kayaks.hungrycats.org (Postfix, from userid 1002) id 509CF2B521B; Thu, 7 Apr 2022 08:27:54 -0400 (EDT) Date: Thu, 7 Apr 2022 08:27:54 -0400 From: Zygo Blaxell To: Josef Bacik Cc: Marc MERLIN , "linux-btrfs@vger.kernel.org" Subject: Re: figuring out why transient double raid failure caused a fair amount of btrfs corruption Message-ID: References: <20220405200805.GD28707@merlins.org> <20220406000844.GK28707@merlins.org> <20220406040913.GE3307770@merlins.org> <20220406191317.GC14804@merlins.org> <20220406203811.GF14804@merlins.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Wed, Apr 06, 2022 at 04:51:14PM -0400, Josef Bacik wrote: > On Wed, Apr 6, 2022 at 4:38 PM Marc MERLIN wrote: > > > > This is an interesting discussion, so let's make a new thread out of it. > > TL;DR: I think btrfs may have failed to go read only earlier, causing > > more damage than needed to be, or some block layers just held enough > > data in flight that partial data got written, causing more damage than > > expected. > > Figuring out the underlying problem would be good to avoid this again. > We can't do anything about the disks lying to us. If a disk has a > wonky FUA/FLUSH implementation then we're just sort of screwed. > Unfortunately because our metadata moves around a lot we're waaaaay > more susceptible to this failure case than ext4 or xfs, their metadata > is relatively static they can put humpty dumpty back together again > relatively simply. Disks are pretty good at FUA/FLUSH bugs these days, though you can still buy new old drives that have them. SSDs have a whole lot of bugs, especially if they get old. have a few bits fail, and try to recover themselves, but blow up the whole filesystem by losing a few pages. These ones get through firmware qualification testing because they only misbehave when they're a year or two old. > Btrfs needs to > > 1. Go whole hog on error injection testing. I only barely scratched > the surface with my bpf error injection stuff. This is on our roadmap > and I plan on devoting developer time to this, but clearly that > doesn't help you right now. > 2. Put a lot more effort into disaster recovery. What I've written > for you is an idea I've had in my head for a while. Some of this > failures aren't catastrophic, we can generally pretty easily put back > together a file system that resembles something sane by simply > stitching together blocks that we find that are close enough to what > we wanted. Unfortunately this gets back-burnered because in reality > this doesn't happen that often. This does seem to be an important missing piece at the moment. In practice, when we see "parent transid verify failed," we go directly to mkfs + restore backups, as none of the existing tools will touch it. They all want intact interior nodes in some tree to work, and that's the one thing that FUA/FLUSH bugs destroy. It doesn't solve the underlying problem--the drive will still trash the filesystem every other month until its write ordering gets fixed or worked around--but at least it lets users fix the filesystem in place without mkfs + restoring a full set of backups. > 3. Test these btrfs+dmcrypt+mdraid setups. Every time I notice one of > these catastrophic failures it generally involves btrfs+ else>. This is likely just because it's a timing thing, you put more > layers you get a wider window in per-io races, you're more likely to > be sad in the event of a failure. However it would be good to make > sure these layers are doing the correct thing themselves. bcache and dmcache layers in series with the underlying storage multiply the failure rates (including software regressions) of the individual components, and mdadm blends firmware bugs from all the drives together. Statistically we're always going to see more problems if there are more moving parts in the system, especially if they aggregate risks and break fault isolation. FWIW I'm not seeing a difference in failure rate between "btrfs + mdadm" and "btrfs alone", but I'm intentionally avoiding the many mdadm configurations that introduce new failure modes by design (e.g. raid5 without ppl or journal, raid1 without component device integrity, or single/raid1 SSD cache in writeback mode running over multiple drives). There isn't a fix for those except to not use them. In most cases there's a better way to do the same thing, though there are some gaps (e.g. there's no working solution for writeback SSD caching on raid5). Thought experiment: if you have 2 drives from vendor A and 2 drives from vendor B, and you want to build two filesystems that replicate to each other, where do you put the drives if you know one vendor has a firmware bug but you don't know which one? With mdadm, you build filesystems with AA and BB drives, because that way a firmware bug corrupts one filesystems and leaves the other intact. With btrfs, you build two filesystems with AB and AB drives, because that way a firmware bug gets autocorrected by btrfs and both filesystems remain intact. If you put btrfs on mdadm on AB drives, you combine the risk from both vendors, lose both filesystems, and skew the failure statistics. > We need to be better about this scenario, both in making sure we don't > have bugs that contribute to the problem, but also that we have the > tools necessary to recover when things go wrong. Thanks, > > Josef