From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C3675C43381 for ; Mon, 18 Feb 2019 21:06:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 7841F217F5 for ; Mon, 18 Feb 2019 21:06:53 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=colorremedies-com.20150623.gappssmtp.com header.i=@colorremedies-com.20150623.gappssmtp.com header.b="MQv4Y//5" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726671AbfBRVGw (ORCPT ); Mon, 18 Feb 2019 16:06:52 -0500 Received: from mail-lj1-f193.google.com ([209.85.208.193]:37563 "EHLO mail-lj1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725806AbfBRVGv (ORCPT ); Mon, 18 Feb 2019 16:06:51 -0500 Received: by mail-lj1-f193.google.com with SMTP id r10-v6so15579321ljj.4 for ; Mon, 18 Feb 2019 13:06:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=colorremedies-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=1TcnnakyoFuI7sus1zjPcpTaynWC9SRPIhuKuVO+81o=; b=MQv4Y//5Lq7ZI1UfDqcpLrrZZlAQqbai5iBduPUp80lXwOcmfsL3Z0addvLM0aWOM1 MvhV+rHoC6RMSfyHfxhmZ9AJ7/4QhJhzf9YGakWdhzumvgGKnyg9QbFgu0ch95/exCvF QtFoqX04ZzWtOG6bqZiDsadUqntBgijwQwpOYGiqAyQNGDIMBaDCi+VzZEXDpcahBzoA slup3pMPUp40UKZmvlunfwlfqlbuOnLHbTbAgp1m7QsvXbKxL9CX5WWmtXyvMjtS+lr/ lFek777cU7FLgi+fod64iVLc1roNFhqa77DIGLd8ZLJvGOuud1Ccj5PWI1z0SD+iLiJT BfUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=1TcnnakyoFuI7sus1zjPcpTaynWC9SRPIhuKuVO+81o=; b=DtnYWS0ceGYOsJtTDUSNkFIjaVXsyflfnZCrQbNC0w8kj30aZo8Kq11fLzupZTL5tM jgSroDAfQ3yQo5v9PwU6WLOe0rx7srBdDzFDLixVVZ2E1tnw/yHG/bVFHioMdgdCISQJ 3H9k5rRKhTfuHwrSUXvTT9I4cwctivamPHGItxTCwCb7Tq/rsmjv37sDapAHcOIlvN0p 9zjdBcFBxeKy8Akoq8M5GA+mWQJ/j7pNDd4cMQD3fH4zhdsoelMfOIbbuK4M8uSfaJa6 9GybrjgT4fIwOTqvMRuS398mkbDy7kL2Wo5G6/y4CzC0gD+CV6jCasD9zEEsNRTsrbHZ yFWA== X-Gm-Message-State: AHQUAuZuBUNEMVnE7NB/xeuTxju3uEXOFdoa3GcSxRX5bwoGDscUrX/x HpYKOgWfgsKPWXkImT0oOgyHuHngjdH97cv6mr4trE8x X-Google-Smtp-Source: AHgI3IbKxVHrfSpn3JElA3DLwDHbtN2he1Am9mBv3sjG4lV+EvT3hOkiiamNhHLnhNf//zJghUAS67vtFX54RN+bL4E= X-Received: by 2002:a2e:9a09:: with SMTP id o9-v6mr15150775lji.132.1550524008438; Mon, 18 Feb 2019 13:06:48 -0800 (PST) MIME-Version: 1.0 References: <7ef0e91501a04cd4c5e0d942db638a0b50ef3ec3.camel@seblu.net> <91e2c9ef095eae21f9e88f7b5cf49102571dcba8.camel@seblu.net> In-Reply-To: <91e2c9ef095eae21f9e88f7b5cf49102571dcba8.camel@seblu.net> From: Chris Murphy Date: Mon, 18 Feb 2019 14:06:36 -0700 Message-ID: Subject: Re: Corrupted filesystem, looking for guidance To: =?UTF-8?Q?S=C3=A9bastien_Luttringer?= Cc: Chris Murphy , linux-btrfs Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Mon, Feb 18, 2019 at 1:14 PM S=C3=A9bastien Luttringer = wrote: > > On Tue, 2019-02-12 at 15:40 -0700, Chris Murphy wrote: > > On Mon, Feb 11, 2019 at 8:16 PM S=C3=A9bastien Luttringer wrote: > > > > FYI: This only does full stripe reads, recomputes parity and overwrites= the > > parity strip. It assumes the data strips are correct, so long as the > > underlying member devices do not return a read error. And the only way = they > > can return a read error is if their SCT ERC time is less than the kerne= l's > > SCSI command timer. Otherwise errors can accumulate. > > > > smartctl -l scterc /dev/sdX > > cat /sys/block/sdX/device/timeout > > > > The first must be a lesser value than the second. If the first is disab= led > > and can't be enabled, then the generally accepted assumed maximum time = for > > recoveries is an almost unbelievable 180 seconds; so the second needs t= o be > > set to 180 and is not persistent. You'll need a udev rule or startup sc= ript > > to set it at every boot. > All my disks firmwares doesn't allow ERC to be modified trough SCT. > > # smartctl -l scterc /dev/sda > smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.19.20-seblu] (local buil= d) > Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontool= s.org > > SCT Error Recovery Control command not supported > > I was not aware of that timer. I needed time to read and experiment on th= is. > Sorry for the long response time. I hope you didn't timeout. :) > > After simulated several errors and timeouts with scsi_debug[1], > fault_injection[2], and dmsetup[3], I don't understand why you suggest th= is > could lead to corruption. When an SCSI command timeout, the mid-layer[4] = do > several error recovery attempt. These attempts are logged into the kernel= ring > buffer and at worst the device is put offline. No at worst what happens if SCSI command timer is reached before the drive's SCT ERC timeout, is the kernel assumes the device is not responding and does a link reset. That link reset obiterates the entire command queue on SATA drives. And that means it's no longer possible to determine what sector is having a problem; and therefore not possible to fix it by overwriting that sector with good data. This is a problem for Btrfs raid, as well as md and LVM. > > From my experiment, the md layer has no timeout, and waits as long as the > underlying layer doesn't return, either during check or normal read/write > attempt. > > I understand the benefits of keeping the disk time to recover from errors= below > the hba timeout. It prevents the disk to be kicked out of the array. The md driver tolerates a fixed number or rate (I'm not sure which) of read errors before a drive is marked faulty. The md driver I think tolerates only one write failure, and then the drive is marked faulty. So far there is no faulty concept in Btrfs, there are patches upstream for this, but I don't know about their merge status. > However, I don't see how this could lead to a difference between check an= d > repair in the md layer and even trigger some corruption between the chunk= s > inside a stipe. It allows bad sectors to accumulate, because they never get repaired. The only way they can be repaired is if the drive itself gives up on a sector, and reports a discrete uncorrected read error along with the sector LBA. That's the only way the md driver knows what md chunk is affected, and where to get a good copy, read it, and then overwrite the bad copy on the device with a read error. The linux-raid@ list is full of examples of this. And it does sometimes lead to the loss of the array, in particular in the case of parity arrays where such read errors tend to be colocated. A read error in a stripe is functionally identical to a single device loss for that stripe. So if the bad sector isn't repaired, only one more error is needed and you get a full stripe loss, and it's not recoverable. If the lost stripe is (user) data only then you just lose a file. But if the lost stripe contains file system metadata it can mean the loss of the file system on that md array. > After reading the whole md (5) manual, I realize how bad it is to rely on= the > md layer to guaranty data integrity. There is no mechanism to known which= chunk > is corrupted in a stripe. Correct. There is a tool part of mdadm that will do this if it's a raid6 ar= ray. > I'm wondering if using btrfs raid5, despite its known flaws, it is not sa= fer > than md. I can't point to a study that'd give us the various probabilities to answer this question. In the meantime, I'd say all raid5 is fraught with peril the instant there's any unhandled corruption or read error. And it's a very common misconfiguration to have consumer SATA drives that lack configurable SCT ERC so that it's less time to produce an error, than for the SCSI command timer to cause a link reset. > > > Further, if the mismatches are consistently in the same sector range, i= t > > suggests the repair scrub returned one set of data, and the subsequent = check > > scrub returned different data - that's the only way you get mismatches > > following a repair scrub. > It was the same range. That was my understanding too. > > I finally get ride of these errors by removing a disk, wiping the superbl= ock > and adding it back to the raid. Since then, no check error (tested twice)= . *shrug* I'm not super familiar with all the mdadm features. It's vaguely possible your md array is using the bad block mapping feature, and perhaps that's related to this behavior. Something in my memory is telling me that this isn't really the best feature to have enabled in every use case; it's really strictly for continuing to use drives that have all reserve sectors used up, which means bad sectors result in write failures. The bad block mapping allows md to do its own remapping so there won't be write failures in such a case. Anyway, raids are complicated and they are something of a Rube Goldberg contraption. If you don't understand all the possible outcomes, and aren't prepared for failures, it can lead to panic. And I've read on linux-raid a lot of panic induced dataloss. Really common is people do google searches first and get bad advice like recreating an array and then they wonder why there array is wiped... *shrug* My advice is, don't be in a hurry to fix things when they go wrong. Collect information. Do things that don't write changes anywhere. Post all information to the proper mailing list working from the bottom (start) of the storage stack to the top (the file system), and trust their advise. > > > If it's bad RAM, then chances are both copies of metadata will be ident= ically > > wrong and thus no help in recovery. > RAM is not ECC. I tested the RAM recently and no error was found. You might check the archives about various memory testing strategies. A simple hour long test often won't find the most pernicious memory errors. At least do it over a weekend. Quick search austin hemmelgarn memory test compile and I found this thread: Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair Wed, May 4, 2016, 10:12 PM > But, I needed more RAM to rsync all the data w/ hardlinks, so I added a s= wap > file on my system disk (an ssd). The filesystem on it is also btrfs, so I= used > a loop device to workaround the hole issue. > I can find some link reset on this drive at time it was used as swap file= . > Maybe this could be a reason. Yeah, if there is a link reset on the drive, the whole command queue is lost. It could cause a bunch of i/o errors that look scary but are one time errors that are related to the link reset. So you really don't want the link resets happening. Conversely many applications get mad if there really is a hang for 180 seconds for a consumer drive to do deep recovery. So it's a catch 22 if you use case can tolerate it. But hopefully you only rarely have bad sectors anyway. Once nice thing about Btrfs is you can do a balance and it causes everything to be written out, which itself "refreshens" sector data with a stronger signal. You probably shouldn't have to do that too often, maybe once every 12-18 months. Otherwise, too many bad sectors is a valid warranty claim. > I think I will remove the md layer and use only BTRFS to be able to recov= er > from silent data corruption. Btrfs on top of md will still repair metadata from data corruption if the metadata profile is DUP. And in the case of (user) data corruption, it's still not silent. Btrfs will tell you what file is corrupt and you can recover it from a backup. I can't tell you that Btrfs raid5 with a missing/failed drive is anymore reliable than md raid5. In a way it's simpler so that might be to your advantage, it really depends on your comfort and experience with user space tools. If you do want to move to strictly Btrfs, I suggest raid5 for data but use raid1 for metadata instead of raid5. Metadata raid 5 writes can't really be assured to be atomic. Using raid1 metadata is less fragile. No matter what, keep backups up to date, always be prepared to have to use them. The main idea of any raid is to just give you some extra uptime in the face of a failure. And the uptime is for your applications. > But I'm curious to be able to repair a broken BTRFS without moving all th= e > dataset to another place. It's the second time it happen to me. > > I tried: > # btrfs check --init-extent-tree /dev/md127 > # btrfs check --clear-space-cache v2 /dev/md127 > # btrfs check --clear-space-cache v1 /dev/md127 > # btrfs rescue super-recover /dev/md127 > # btrfs check -b --repair /dev/md127 > # btrfs check --repair /dev/md127 > # btrfs rescue zero-log /dev/md127 Wrong order. Not obvious either that it's the wrong order, the tools don't do a great job of telling us what order to do things in. Also, all of these involve writes. You really need to understand the problem first. zero log means some last minute writes will be lost, and it should only be used if there's difficulty mounting and the kernel errors point to a problem with log replay. clear-space is safe, the cache is recreated at next mount time, so it might result in slow initial mount after use. super-recover is safe by itself or with -v. It should be safe with -y but -y does write changes to disk. --init-extent-tree is about the biggest hammer in the arsenal and fixes only a very specific problem with the extent tree and usually doesn't help just makes things worse. --repair should be safe but even in 4.20.1 tools you'll see the man page says it's dangerous and you should ask on list before using it. > The detailed output is here [6]. But none of the above allowed me to drop= the > broken part of the btrfs tree to move forward. Is there a way to repair (= by > loosing corrupted data) without need to drop all the correct data? Well at this point if you ran a those commands the file system is different so you should refresh the thread by posting current normal mount (no options) kernel messages; and also 'btrfs check' output without repair; and also output from btrfs-debug-tree. If the problem is simple enough and a dev has time it might be they get you a file system specific patch to apply and it can be fixed. But it's really important that you stop making changes to the file system in the meantime. Just gather information. Be deliberate. -- Chris Murphy