From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 90FF8C41604 for ; Tue, 6 Oct 2020 17:20:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 247A120782 for ; Tue, 6 Oct 2020 17:20:19 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=inwind.it header.i=@inwind.it header.b="mv589dqV" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726478AbgJFRUS (ORCPT ); Tue, 6 Oct 2020 13:20:18 -0400 Received: from smtp-18.italiaonline.it ([213.209.10.18]:58502 "EHLO libero.it" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726138AbgJFRUR (ORCPT ); Tue, 6 Oct 2020 13:20:17 -0400 X-Greylist: delayed 489 seconds by postgrey-1.27 at vger.kernel.org; Tue, 06 Oct 2020 13:20:15 EDT Received: from venice.bhome ([94.37.195.85]) by smtp-18.iol.local with ESMTPA id PqVQkdO98l7OPPqVQkkDM1; Tue, 06 Oct 2020 19:12:05 +0200 x-libjamoibt: 1601 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=inwind.it; s=s2014; t=1602004325; bh=IXgNuMSs2cPtsdjq0eQhwVzWAdckjIiSFnoX/GKdGag=; h=From; b=mv589dqVmsXqtoYa7XlI7WhFpJXWYVasV2oyw9QJ0FJ7kOTu9wT6YC1eCtH1jtk2f cnOwUosNYsiRK7VK3I/qAcCH3YSf7H6gYiTj632JclvGdKlGHKkTGKDGHY7qIGwm2q Mrk/RWkmEtYkagOp3hf5UifI41l2g1+epPg/6JJ27SxfrFZUYWUlIWAhZTHXOsvk+j 1W/+H2QAzfoVu6ID0UrCmFzW13movi5fu9xGjO7rP0OoLewPW4AG/3yZUjIOHxHW5F gPsadvCeAmIzqQAt/CIIg7WQQOHgrAnLFKBkmuhPLW+vkUJk+w/k+63P9VFnJUY++u ZdYtU2n7B5nbg== X-CNFS-Analysis: v=2.4 cv=VL9jI/DX c=1 sm=1 tr=0 ts=5f7ca565 a=6i7ZtOMYC9jVyZJnprbtYA==:117 a=6i7ZtOMYC9jVyZJnprbtYA==:17 a=IkcTkHD0fZMA:10 a=VwQbUJbxAAAA:8 a=7YfXLusrAAAA:8 a=jHGVAbiu1z7yCMA1S7cA:9 a=eB-9-X1twNelMSns:21 a=JZL66dzF2liN54IP:21 a=QEXdDO2ut3YA:10 a=AjGcO6oz07-iQ99wixmX:22 a=SLz71HocmBbuEhFRYD3r:22 Reply-To: kreijack@inwind.it Subject: Re: using raid56 on a private machine To: Zygo Blaxell Cc: cryptearth , linux-btrfs@vger.kernel.org, Josef Bacik References: <91a18b63-6211-08e1-6cd9-8ef403db1922@libero.it> <20201006012427.GD21815@hungrycats.org> From: Goffredo Baroncelli Message-ID: Date: Tue, 6 Oct 2020 19:12:04 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.3.1 MIME-Version: 1.0 In-Reply-To: <20201006012427.GD21815@hungrycats.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-CMAE-Envelope: MS4xfOBq4h9W3FEuqPqOD3EEby9yF4M3qA1497cdV5nXTV0UcporicQO2Zu1nbobO9Z+82+PmYJ1elRbz9zmKl5wJExTnIsVRPFXYP0sfvT2Fn1nDUTh2kmS t7lpG86ckA6sQGhU/RmfrbYUasnUTNqbJDuTy65vzdsqLNkCq/9ixRckTQ+cZckysFhyU6CstdKsFpxxAFf9AKZX9QHVjtpAQdat0lPAMIH6PX3A1RfmAs2B W/d7pEQ7zIxy/As6TDiC366+hgfKkcCGBB8HF3GLYHOB3h009TsYZjQAgO8DU/pt Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On 10/6/20 3:24 AM, Zygo Blaxell wrote: > On Mon, Oct 05, 2020 at 07:57:51PM +0200, Goffredo Baroncelli wrote: [...] >> >> I have only few suggestions: >> 1) don't store valuable data on BTRFS with raid5/6 profile. Use it if >> you want to experiment and want to help the development of BTRFS. But >> be ready to face the lost of all data. (very unlikely, but more the >> size of the filesystem is big, more difficult is a restore of the data >> in case of problem). > > Losing all of the data seems unlikely given the bugs that exist so far. > The known issues are related to availability (it crashes a lot and > isn't fully usable in degraded mode) and small amounts of data loss > (like 5 blocks per TB). From what I reading in the mailing list when the problem is too complex to solve to the point that the filesystem has to be re-format, quite often the main issue is not to "extract" the data, but is about the availability of additional space to "store" the data. > > The above assumes you never use raid5 or raid6 for btrfs metadata. Using > raid5 or raid6 for metadata can result in total loss of the filesystem, > but you can use raid1 or raid1c3 for metadata instead. > >> 2) doesn't fill the filesystem more than 70-80%. If you go further >> this limit the likelihood to catch the "dark and scary corners" >> quickly increases. > > Can you elaborate on that? I run a lot of btrfs filesystems at 99% > capacity, some of the bigger ones even higher. If there were issues at > 80% I expect I would have noticed them. There were some performance > issues with full filesystems on kernels using space_cache=v1, but > space_cache=v2 went upstream 4 years ago, and other significant > performance problems a year before that. My suggestion was more to have enough space to not stress the filesystem than "if you go behind this limit you have problem". A problem of BTRFS that confuses the users is that you can have space, but you can't allocate a new metadata chunk. See https://lore.kernel.org/linux-btrfs/6e6565b2-58c6-c8c1-62d0-6e8357e41a42@gmx.com/T/#t Having the filesystem filled to 99% means that you have to check carefully the filesystem (and balance it) to avoid scenarios like this. On other side 1% of 1TB (a small filesystem for today standard) are about 10GB, that everybody should consider enough.... > The last few GB is a bit of a performance disaster and there are > some other gotchas, but that's an absolute number, not a percentage. True, it is sufficient to have few GB free (i.e. not allocated by chunk) in *enough* disks... However these requirements are a bit complex to understand by a new BTRFS users. > > Never balance metadata. That is a ticket to a dark and scary corner. > Make sure you don't do it, and that you don't accidentally install a > cron job that does it. > >> 3) run scrub periodically and after a power failure ; better to use >> an uninterruptible power supply (this is true for all the RAID, even >> the MD one). > > scrub also provides early warning of disk failure, and detects disks > that are silently corrupting your data. It should be run not less than > once a month, though you can skip months where you've already run a > full-filesystem read for other reasons (e.g. replacing a failed disk). > >> 4) I don't have any data to support this; but as occasional reader of >> this mailing list I have the feeling that combing BTRFS with LUCKS(or >> bcache) raises the likelihood of a problem. > I haven't seen that correlation. All of my machines run at least one > btrfs on luks (dm-crypt). The larger ones use lvmcache. I've also run > bcache on test machines doing power-fail tests. > > That said, there are additional hardware failure risks involved in > caching (more storage hardware components = more failures) and the > system must be designed to tolerate and recover from these failures. > > When cache disks fail, just uncache and run scrub to repair. btrfs > checksums will validate the data on the backing HDD (which will be badly > corrupted after a cache SSD failure) and will restore missing data from > other drives in the array. > > It's definitely possible to configure bcache or lvmcache incorrectly, > and then you will have severe problems. Each HDD must have a separate > dedicated SSD. No sharing between cache devices is permitted. They must > use separate cache pools. If one SSD is used to cache two or more HDDs > and the SSD fails, it will behave the same as a multi-disk failure and > probably destroy the filesystem. So don't do that. > > Note that firmware in the SSDs used for caching must respect write > ordering, or the cache will do severe damage to the filesystem on > just about every power failure. It's a good idea to test hardware > in a separate system through a few power failures under load before > deploying them in production. Most devices are OK, but a few percent > of models out there have problems so severe they'll damage a filesystem > in a single-digit number of power loss events. It's fairly common to > encounter users who have lost a btrfs on their first or second power > failure with a problematic drive. If you're stuck with one of these > disks, you can disable write caching and still use it, but there will > be added write latency, and in the long run it's better to upgrade to > a better disk model. > >> 5) pay attention that having an 8 disks raid, raises the likelihood of a >> failure of about an order of magnitude more than a single disk ! RAID6 >> (or any other RAID) mitigates that, in the sense that it creates a >> time window where it is possible to make maintenance (e.g. a disk >> replacement) before the lost of data. >> 6) leave the room in the disks array for an additional disk (to use >> when a disk replacement is needed) >> 7) avoid the USB disks, because these are not reliable >> >> >>> >>> Any information appreciated. >>> >>> >>> Greetings from Germany, >>> >>> Matt >> >> >> -- >> gpg @keyserver.linux.it: Goffredo Baroncelli >> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >> -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5