From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7A2FEC433F5 for ; Thu, 12 May 2022 02:39:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237933AbiELCjF (ORCPT ); Wed, 11 May 2022 22:39:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44106 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229499AbiELCjF (ORCPT ); Wed, 11 May 2022 22:39:05 -0400 Received: from drax.kayaks.hungrycats.org (drax.kayaks.hungrycats.org [174.142.148.226]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id E68C214A265 for ; Wed, 11 May 2022 19:39:02 -0700 (PDT) Received: by drax.kayaks.hungrycats.org (Postfix, from userid 1002) id EF73C33C9FF; Wed, 11 May 2022 22:39:00 -0400 (EDT) Date: Wed, 11 May 2022 22:39:00 -0400 From: Zygo Blaxell To: kreijack@inwind.it Cc: Josef Bacik , Marc MERLIN , linux-btrfs Subject: Re: Rebuilding 24TB Raid5 array (was btrfs corruption: parent transid verify failed + open_ctree failed) Message-ID: References: <20220511000815.GK12542@merlins.org> <20220511014827.GL12542@merlins.org> <20220511150319.GM29107@merlins.org> <20220511160009.GN12542@merlins.org> <6e182895-9998-cf39-04e4-9542d79fc81d@libero.it> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6e182895-9998-cf39-04e4-9542d79fc81d@libero.it> Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Wed, May 11, 2022 at 08:00:07PM +0200, Goffredo Baroncelli wrote: > On 11/05/2022 18.05, Josef Bacik wrote: > > On Wed, May 11, 2022 at 12:00 PM Marc MERLIN wrote: > > > > > > On Wed, May 11, 2022 at 11:21:37AM -0400, Josef Bacik wrote: > [...] > > Hi Joseph, Marc, > > I looked back in the thread but I was unable to find if it was discussed. > Even if the size of the FS is quite big, I am wondering if does make sense > that Marc send to Josef the metadata of the filesystem, to speed up the > recover. > I am sure that btrfs-image was considered but then discarded (due to a risk > of leaking of sensible information ? or may be the image would be too big ?); The metadata would be 15GB for the csums alone (according to a btrfs-dump-super in the thread, only 15TB of the 24TB total space is used). Subvol metadata would be on top of that. Too big for an email, but not impossible to transfer bidirectionally with a tool such as rsync. Alternatively, Josef could debug the tool on the copy of the metadata, then send the tool to Marc to run again on the original metadata, and unidirectional transfer will suffice. Handily, this also provides Marc with a restorable backup copy of the metadata, so if there are bugs in the recovery tool, the original metadata can be restored to try again with a fixed tool. Since the metadata tree is damaged, it can't be traversed (fixing that is the whole point of this exercise), so btrfs-image won't work. The image would have to be a raw dump of all the metadata block groups to get all the potentially relevant pages (including orphan leaf pages whose parents have been lost due to the disk failures). In most cases capturing the free space in metadata block groups would at most double the size, but 30GB is still manageable. This is a relatively simple tool to build for the case where the chunk tree is intact, which it seems to be in this case. If it wasn't, a brute-force tool could scan the entire disk and pick up anything that looks like a metadata page, then write a chunk tree that matches the majority of collected pages (given a metadata page header's bytenr and the on-disk position, we can identify which metadata chunk it belongs to; the first and last page within that chunk provides the logical and physical boundaries of the chunk). Dropping the csum pages from the image is possible. They all have distinct item keys not found in any other metadata page, so they're easy to spot on disk and leave out of the transfer; however, the more pages that aren't part of the image, the more pages that are lost or awkwardly unverifiable, increasing risk that the recovery tool won't be able to fix the filesystem. It's likely that at least the interior nodes of the csum tree will still need to be part of the recovery operation, if not all the leaf nodes as well. The alternative is to blow up the csum tree and generate a new one from subvol data, but that means no data integrity checking on any of the data blocks (which are likely also corrupted by the disk failures). Undetected corruption becomes inevitable. Assuming no snapshots and the distribution of file sizes and types mentioned in the thread (100MB..10GB files, no inline), it would be maybe 10-15 MB for one subvol tree, double that to include the extent tree for eliminating duplicate or obsolete pages. That _does_ fit in an email, especially after compression (subvol trees are highly compressible, half the bits are zero and a third of the rest are consecutive or constant numbers). > It would be interested to know the reasons; may be that even if btrfs-image > doesn't fit this particular case in the current form, it can be extended > to handle a case like the Marc one.. > > BR > -- > gpg @keyserver.linux.it: Goffredo Baroncelli > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5