From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 35858C43461 for ; Sat, 12 Sep 2020 17:34:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id F2C8A21531 for ; Sat, 12 Sep 2020 17:34:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725873AbgILReP convert rfc822-to-8bit (ORCPT ); Sat, 12 Sep 2020 13:34:15 -0400 Received: from li1843-175.members.linode.com ([172.104.24.175]:40678 "EHLO mail.stoffel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725838AbgILReO (ORCPT ); Sat, 12 Sep 2020 13:34:14 -0400 X-Greylist: delayed 342 seconds by postgrey-1.27 at vger.kernel.org; Sat, 12 Sep 2020 13:34:14 EDT Received: from quad.stoffel.org (066-189-075-104.res.spectrum.com [66.189.75.104]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.stoffel.org (Postfix) with ESMTPSA id 7C02421347; Sat, 12 Sep 2020 13:28:31 -0400 (EDT) Received: by quad.stoffel.org (Postfix, from userid 1000) id B4BD0A6665; Sat, 12 Sep 2020 13:28:30 -0400 (EDT) MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8BIT Message-ID: <24413.1342.676749.275674@quad.stoffel.home> Date: Sat, 12 Sep 2020 13:28:30 -0400 From: "John Stoffel" To: antlists Cc: Brian Allen Vanderburg II , linux-raid@vger.kernel.org Subject: Re: Linux raid-like idea In-Reply-To: <9220ea81-3a81-bb98-22e3-be1a123113a1@youngman.org.uk> References: <1cf0d18c-2f63-6bca-9884-9544b0e7c54e.ref@aim.com> <1cf0d18c-2f63-6bca-9884-9544b0e7c54e@aim.com> <5F54146F.40808@youngman.org.uk> <274cb804-9cf1-f56c-9ee4-56463f052c09@aim.com> <38f9595b-963e-b1f5-3c29-ad8981e677a7@aim.com> <9220ea81-3a81-bb98-22e3-be1a123113a1@youngman.org.uk> X-Mailer: VM 8.2.0b under 26.1 (x86_64-pc-linux-gnu) Sender: linux-raid-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-raid@vger.kernel.org >>>>> "antlists" == antlists writes: antlists> On 11/09/2020 21:14, Brian Allen Vanderburg II wrote: >> That's right, I get the various combinations confused.  So does raid61 >> allow for losing 4 disks in any order and still recovering? or would >> some order of disks make it where just 3 disks lost and be bad? >> Iinteresting non-the-less and I'll have to look into it.  Obviously it's >> not intended to as a replacement for backing up important data, but, for >> me any way, just away to minimize loss of any trivial bulk data/files. antlists> Yup. Raid 6 has two parity disks, and that's mirrored to give four antlists> parity disks. So as an *absolute* *minimum*, raid-61 could lose four antlists> disks with no data loss. antlists> Throw in the guarantee that, with a mirror, you can lose an entire antlists> mirror with no data-loss, that means - with luck and a following wind - antlists> you could lose half your disks, PLUS the two parities in the remaining antlists> disks, and still recover your data. So with a raid-6+1, if I had twelve antlists> disks, I could lose EIGHT disks and still have a *chance* of recovering antlists> my array. I'm not quite sure what difference raid-61 would make. Of course your useful storage is 12/2 -2 = 4 disks, so only 33% useable space. Not very good. At that point, I'd just go with RAID 1 pairs and striped together with RAID 0 (RAID 1+0) for only a 50% loss of space. Now if I lose the wrong four disks, I'm screwed, as opposed to before where I can lose *any* four disks. The problem with RAID6 is that random small IO writes have terrible performance. RAID1+0 gives you much better performance. Arstechnica had a great article on this earlier in the year that disk actual testing. I think (and haven't done the math) that Erasure encoding gives you better protection as you scale. And I've even thought about glusterfs or cephfs for home storage using a bunch of small single board computers each talking to one disk for storage. But... it's hard to justify. These days, I think it's better for your main storage to be a three way mirror for the important stuff, performance wise. And RAID6 with hot spare for your large streaming collections like videos and such. But even then... it's hard to justify since it costs alot. antlists> (That says to me, if I have a raid-61, I need as a minimum a complete antlists> set of data disks. That also says to me, if I've splatted an 8+2 raid-61 antlists> across 11 disks, I only need 7 for a full recovery despite needing a antlists> minimum of 8, so something isn't quite right here... I suspect the 7 antlists> would be enough but I did say my mind goes Whooaaa!!!!) >> It would be nice if the raid modules had support for methods that could >> support a total of more disks in any order lost without loosing data. >> Snapraid source states that it uses some Cauchy Matrix algorithm which >> in theory could loose up to 6 disks if using 6 parity disks, in any >> order, and still be able to restore the data.  I'm not familiar with the >> math behind it so can't speak to the accuracy of that claim. antlists> That's easy, it's just whether it's worth it. Look at the antlists> maths behind raid-6. The "one parity disk" methods, 4 or 5, antlists> just use XOR. But that only works once, a second XOR parity antlists> disk adds no new redundancy and is worthless. I'm guessing antlists> raid-6 uses that Cauchy method you talk about - certainly it antlists> can generate as many parity disks as you like ... so that antlists> claim is good, even if raid-6 doesn't use that particular antlists> technique. antlists> If someone wants to, mod'ing raid-6 to use 3 parity disks antlists> shouldn't be that hard ... It's not, but there's diminishing returns because you now have to do the RMW cycle across even more disks, which is slow. antlists> But going back to your original idea, I've been thinking antlists> about it. And it struck me - you NEED to regenerate parity antlists> EVERY TIME you write data to disk! Otherwise, writing one antlists> file on one disk instantly trashes your ability to recover antlists> all the other files in the same position on the other antlists> disks. WHOOPS! But if you think it's a good idea, by all antlists> means try and do it. Correct. When compute parity, you do it across blocks. And the parity calculation is effectively free these days. The cost comes from the (on disks at least) rotational latency to read the entire stripe across all the disks, modify one to N bytes in that stripe, then re-writing the stripe back to all the disks. That's alot of IO. With RAID1, you just make two writes, one to each disk. Done. Even with a three way mirror, it's simpler. Now the RAID6 works better if you are replacing the entire stripe, then you can drop your IOs in half, but you still need to write chunks to different disks. This is why big vendors have log based filesystems (Netapp, EMC, Isilon, etc) with battery backed RAM caches, so they can A) tell the client the writes are done, B) collect large changes into bigger chunks, and C) write them in linear fashion down to the disk. Log based filesystems are great for this. Until they get fragmented. SSDs help in that they really don't have a seek cost at all, so you can handle fragmentation better. BUT! SSDs are generally written assuming 512 byte blocks, but the underlying SSDs now generally use 4k blocks on the NAND flash, so there's another layer of fragmentation and wear levelling and other stuff happening outside your control there as well. antlists> The other thing I'd suggest here, is try and make it more antlists> like raid-5 than raid-4. You have X disks, let's say 5. So antlists> one disk each is numbered 0, 1, 2, 3, 4. As part of antlists> formatting the disk ready for raid, you create a file antlists> containing every block where LBA mod 5 equals disk antlists> number. So as you recalculate your parities, that's where antlists> they go. RAID4 suffers from the parity disk becoming a super hot spot, since it needs to get written to no matter what. No one uses it. Until we can get back to cost effective SSDs using SLC NAND, RAID is here to stay. And so is mirroring since it does help protect from alot of issues, both permanent and temporary. I had one of my 4tb disks fall out of my main VG, but I didn't lose and data, I just checked the disk and added it back in. I've got a new 4tb disk on order along with a drive cage so I can balance things better. But it's almost to the point where it's cheaper to buy a pair of 8tb drives to replace the 4x4tb drives I'm using now. But I probably won't. I could write for hours here... it's a tough problem space to work through. John