From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 08B44C3A5A9 for ; Sat, 2 May 2020 09:09:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E40A5216FD for ; Sat, 2 May 2020 09:09:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727791AbgEBJJs convert rfc822-to-8bit (ORCPT ); Sat, 2 May 2020 05:09:48 -0400 Received: from james.kirk.hungrycats.org ([174.142.39.145]:34768 "EHLO james.kirk.hungrycats.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727776AbgEBJJr (ORCPT ); Sat, 2 May 2020 05:09:47 -0400 Received: by james.kirk.hungrycats.org (Postfix, from userid 1002) id B67CB69FF60; Sat, 2 May 2020 05:09:46 -0400 (EDT) Date: Sat, 2 May 2020 05:09:46 -0400 From: Zygo Blaxell To: Phil Karn Cc: Paul Jones , "linux-btrfs@vger.kernel.org" Subject: Re: Extremely slow device removals Message-ID: <20200502090946.GO10769@hungrycats.org> References: <14a8e382-0541-0f18-b969-ccf4b3254461@ka9q.net> <20200502033509.GG10769@hungrycats.org> <20200502060038.GK10769@hungrycats.org> <20200502074237.GM10769@hungrycats.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: 8BIT In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Sat, May 02, 2020 at 01:22:25AM -0700, Phil Karn wrote: > On Sat, May 2, 2020 at 12:42 AM Zygo Blaxell > wrote: > > > If you use btrfs replace to move data between drives then you get all > > the advantages you describe. Don't do 'device remove' if you can possibly > > avoid it. > > But I had to use replace to do what I originally wanted to do: replace > four 6TB drives with two 16TB drives. I could replace two but I'd > still have to remove two more. I may give up on that latter part for > now, but my original hope was to move everything to a smaller and > especially quieter box than the 10-year-old 4U server I have now > that's banished to the garage because of the noise. (Working on its > console in single-user is much less pleasant than retiring to the > house and using my laptop.) I also wanted to retire all four 6 TB > drives because they have over 35K hours (four years) of continuous run > time. They keep passing their SMART checks but I didn't want to keep > pushing my luck. I replace drives in arrays one at a time, equally spaced over their warranty period. The replacements are larger, and that requires 3-6-month long balances. I guess the balance time is going to double every 18 months, which means there will come a point where balance takes longer than simply waiting for the next replacement drive to make a pair of disks with unallocated space. I don't want to change the schedule to replace 2 drives at a time as that increases the probability of correlated 2-disk failure. > > If there's data corruption on one disk, btrfs can detect it and replace > > the lost data from the good copy. > > That's a very good point I should have remembered. FS-agnostic RAID > depends on drive-level error detection, and being an early TCP/IP guy > I have always been a fan of end-to-end checks. That said, I can't > remember EVER having one of my drives silently corrupt data. Out of ~120 drive models I've tested, I've only seen 5 spinning drives that silently corrupt data. One disk got hot enough to emit blue smoke, another didn't have the smoky drama but did have obvious bit errors in its DRAM cache. The rest were drives with firmware bugs, so all the instances of specific models had identical issues. On SD/MMC and below-$50 SSDs, silent data corruption is the most common failure mode. I don't think these disks are capable of detecting or reporting individual sector errors. I've never seen it happen. They either fall off the bus or they have a catastrophic failure and give an error on every single access. Some drive-level error events leave scars that look like data corruption to btrfs, e.g. if the firmware crashes before it can empty its write cache, or if the Linux timeout is set too low and the kernel resets the drive before it completes a write. That's so common on low-end desktop drives that I stopped buying them (at least the cheap SSDs weren't slow). > When one > failed, I knew it. (Boy, did I know it.) I can detect silent > corruption even in my ext4 or xfs file systems because I've been > experimenting for years with stashing SHA file hashes in an extended > attribute and periodically verifying them. This originated as a simple > deduplication tool with the attributes used only as a cache. But I > became intrigued by other uses for file-level hashes, like looking for > a file on a heterogeneous collection of machines by multicasting its > hash, and the aforementioned check for silent corruption. (Yes, I know > btrfs checks automatically, but I won't represent what I'm doing as > anything but purely experimental.) Experiment away! The more redundant hashes, the better. I found two btrfs data corruption bugs that way, and the same data makes me confident that there aren't any more (at least with my current application workload). > I've never seen a btrfs scrub produce errors either except very > quickly on one system with faulty RAM, so I was never going to trust > it with real data anyway. (BTW, I believe strongly in ECC RAM. I can't > understand why it isn't universal given that it costs little more.) I've seen one scrub error in a month of testing with a machine that had known bad RAM. btrfs had unrecoverable corruption 3 times in the same interval. > I'm beginning to think I should look at some of the less tightly > coupled ways to provide redundant storage, such as gluster.