From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D7D9DC49361 for ; Sun, 20 Jun 2021 21:49:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A7BA260FF4 for ; Sun, 20 Jun 2021 21:49:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229897AbhFTVvT (ORCPT ); Sun, 20 Jun 2021 17:51:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34408 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229875AbhFTVvR (ORCPT ); Sun, 20 Jun 2021 17:51:17 -0400 Received: from mail-qk1-x72d.google.com (mail-qk1-x72d.google.com [IPv6:2607:f8b0:4864:20::72d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 315EFC061574 for ; Sun, 20 Jun 2021 14:49:03 -0700 (PDT) Received: by mail-qk1-x72d.google.com with SMTP id j184so25604541qkd.6 for ; Sun, 20 Jun 2021 14:49:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=tYIo4PECCKteC47LbdSZxfNF3sAviSnrS+VwkbUAnKs=; b=qGdi9oA0kUwK4Jc0y7445n+p//eCQg1J5Mu+FRlvKb3wNgiUzsOPaFbVaGPCYTyMKj iTsphTUkPYO3Bq8WOsl7SOPfpQI+uNVWEB3OaSlFpN9SFTJT2Z64b8IFrUyc7qPl4Afv JyJmNVZJLpOdGBEeM+1kTG1+kl7yMQjFo4lX4ksAni+i6f4X3BrcpVUd7ElrRnMc23ST jHChpTXmyQTu/oX0opGnzPrjOwOSMJK2sAJ6/lSX9No8v5an4DWUunqflZGnFJYit2MA Lv04/9cYcuET0a3WCUIi9TrBhqc22kvK7yOHHnIFFb1ikyGI/iLlZwHT9DeNGBiSzeDY zupg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=tYIo4PECCKteC47LbdSZxfNF3sAviSnrS+VwkbUAnKs=; b=aFOh3fDA+vJuktXPNCrhD3h7PqoecxLKAx8WwEt3BzlBxL5DIUdT75dv8GZ42UZwhY kn7Xizi/u5/QuIiwpZYunvXu0Vucy9b7QtBEuQef5VV4/+ukHEOqplvY0vE+4Gy008ee dK0y9QW9hPMwV4PzkRnaKmyK2lC88qHUSS/Sj4fRNyLrTntNhtHmR7LHRVSiSWfwA53E HtRNn629vBn5JDe58mwVeJU8BfONEBAqxRW64EbbkSuCXawYD2whC3d6Qc9VN7jeWJHT FfAX4bSCT2zYtuMembF5hVoUNoizzOBE4ScCAy2t5VDZDaYzfO2dLVi1EjAIN3d/SOr/ DaLA== X-Gm-Message-State: AOAM530nEG2mXo5hoX3BFxFrmUw5pBcQlzWzTHkNx/HMg37WfGe721tC Wewo+EcTkF4K+o5Wyn2EYvbAD0d5ZGi1RrTNX3I= X-Google-Smtp-Source: ABdhPJz9HoDoANUqiWPZGm4effnW9eKFXxRrIPycnRivJh/S1S8gmKaGiLBo5DTnQoADl7VBfmRnyQmFi1tfQ8UHiDA= X-Received: by 2002:a25:8111:: with SMTP id o17mr27062030ybk.244.1624225742281; Sun, 20 Jun 2021 14:49:02 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Nathan Dehnel Date: Sun, 20 Jun 2021 21:48:51 +0000 Message-ID: Subject: Re: Recover from "couldn't read tree root"? To: Chris Murphy Cc: Btrfs BTRFS Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org >I suggest searching logs since the last time this file system was working, because the above error is indicating a problem that's already happened and what we need to know is what happened, if possible. Something like this: >journalctl --since=-5d -k -o short-monotonic --no-hostname | grep "Linux version\| ata\|bcache\|Btrfs\|BTRFS\|] hd\| scsi\| sd\| sdhci\| mmc\| nvme\| usb\| vd" Unfortunately I put my journal logs in a different subvolume so they wouldn't bloat my snapshots so they weren't included in my backups. >So I'm gonna guess a single shared SSD, which is a single point of failure, and it's spitting out garbage or zeros. It's 2 SSDs in mdraid RAID10. >But I'm not even close to a bcache expert so you might want to ask bcache developers how to figure out what state bcache is in and whether and how to safely decouple it from the backing drives so that you can engage in recovery attempts. They didn't respond the last couple of times I've asked a question on their irc or mailing list. On Sun, Jun 20, 2021 at 9:19 PM Chris Murphy wrote: > > On Sun, Jun 20, 2021 at 2:38 PM Nathan Dehnel wrote: > > > > A machine failed to boot, so I tried to mount its root partition from > > systemrescuecd, which failed: > > > > [ 5404.240019] BTRFS info (device bcache3): disk space caching is enabled > > [ 5404.240022] BTRFS info (device bcache3): has skinny extents > > [ 5404.243195] BTRFS error (device bcache3): parent transid verify > > failed on 3004631449600 wanted 1420882 found 1420435 > > [ 5404.243279] BTRFS error (device bcache3): parent transid verify > > failed on 3004631449600 wanted 1420882 found 1420435 > > [ 5404.243362] BTRFS error (device bcache3): parent transid verify > > failed on 3004631449600 wanted 1420882 found 1420435 > > [ 5404.243432] BTRFS error (device bcache3): parent transid verify > > failed on 3004631449600 wanted 1420882 found 1420435 > > [ 5404.243435] BTRFS warning (device bcache3): couldn't read tree root > > [ 5404.244114] BTRFS error (device bcache3): open_ctree failed > > This is generally bad, and means some lower layer did something wrong, > such as getting write order incorrect, i.e. failing to properly honor > flush/fua. Recovery can be difficult and take a while. > https://btrfs.wiki.kernel.org/index.php/Problem_FAQ#parent_transid_verify_failed > > I suggest searching logs since the last time this file system was > working, because the above error is indicating a problem that's > already happened and what we need to know is what happened, if > possible. Something like this: > > journalctl --since=-5d -k -o short-monotonic --no-hostname | grep > "Linux version\| ata\|bcache\|Btrfs\|BTRFS\|] hd\| scsi\| sd\| sdhci\| > mmc\| nvme\| usb\| vd" > > > > > btrfs rescue super-recover -v /dev/bcache0 returned this: > > > > parent transid verify failed on 3004631449600 wanted 1420882 found 1420435 > > parent transid verify failed on 3004631449600 wanted 1420882 found 1420435 > > parent transid verify failed on 3004631449600 wanted 1420882 found 1420435 > > parent transid verify failed on 3004631449600 wanted 1420882 found 1420435 > > parent transid verify failed on 3004631449600 wanted 1420882 found 1420435 > > Ignoring transid failure > > ERROR: could not setup extent tree > > Failed to recover bad superblocks > > OK something is really wrong if you're not able to see a single > superblock on any of the bcache devices. Every member device has 3 > super blocks, given the sizes you've provided. For there to not be a > single one is a spectacular failure as if the bcache cache device > isn't returning correct information for any of them. So I'm gonna > guess a single shared SSD, which is a single point of failure, and > it's spitting out garbage or zeros. But I'm not even close to a bcache > expert so you might want to ask bcache developers how to figure out > what state bcache is in and whether and how to safely decouple it from > the backing drives so that you can engage in recovery attempts. > > If bcache mode is write through, there's a chance the backing drives > have valid btrfs metadata, and it's just that on read the SSD is > returning bogus information. > > > > > > -- > Chris Murphy