All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Suggestions for building new 44TB Raid5 array
@ 2022-06-11  4:51 Marc MERLIN
  2022-06-11  9:30 ` Roman Mamedov
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Marc MERLIN @ 2022-06-11  4:51 UTC (permalink / raw)
  To: Andrei Borzenkov
  Cc: Zygo Blaxell, Josef Bacik, linux-btrfs, Chris Murphy, Qu Wenruo

so, my apologies to all for the thread of death that is hopefully going
to be over soon. I still want to help Josef fix the tools though,
hopefully we'll get that filesystem back to a mountable state.

That said, it's been over 2 months now, and I do need to get this
filesystem back up from backup, so I ended up buying new drives (5x
11TiB in raid5).

Given the pretty massive corruption that happened in ways that I still 
can't explain, I'll make sure to turn off all the drive write caches
but I think I'm not sure I want to trust bcache anymore even though
I had it in writethrough mode.

Here's the Email from March, questions still apply:

Kernel will be 5.16. Filesystem will be 24TB and contain mostly bigger
files (100MB to 10GB).

1) mdadm --create /dev/md7 --level=5 --consistency-policy=ppl --raid-devices=5 /dev/sd[abdef]1 --chunk=256 --bitmap=internal
2) echo 0fb96f02-d8da-45ce-aba7-070a1a8420e3 >  /sys/block/bcache64/bcache/attach 
   gargamel:/dev# cat /sys/block/md7/bcache/cache_mode
   [writethrough] writeback writearound none
3) cryptsetup luksFormat --align-payload=2048 -s 256 -c aes-xts-plain64  /dev/bcache64
4) cryptsetup luksOpen /dev/bcache64 dshelf1
5) mkfs.btrfs -m dup -L dshelf1 /dev/mapper/dshelf1

Any other btrfs options I should set for format to improve reliability
first and performance second?
I'm told I should use space_cache=v2, is it default now with btrfs-progs 5.10.1-2 ?

As for bcache, I'm really thinking about droppping it, unless I'm told
it should be safe to use.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
 
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-11  4:51 Suggestions for building new 44TB Raid5 array Marc MERLIN
@ 2022-06-11  9:30 ` Roman Mamedov
       [not found]   ` <CAK-xaQYc1PufsvksqP77HMe4ZVTkWuRDn2C3P-iMTQzrbQPLGQ@mail.gmail.com>
  2022-06-11 23:44 ` Zygo Blaxell
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Roman Mamedov @ 2022-06-11  9:30 UTC (permalink / raw)
  To: Marc MERLIN
  Cc: Andrei Borzenkov, Zygo Blaxell, Josef Bacik, linux-btrfs,
	Chris Murphy, Qu Wenruo

On Fri, 10 Jun 2022 21:51:20 -0700
Marc MERLIN <marc@merlins.org> wrote:

> Kernel will be 5.16. Filesystem will be 24TB and contain mostly bigger
> files (100MB to 10GB).

> 2) echo 0fb96f02-d8da-45ce-aba7-070a1a8420e3 >  /sys/block/bcache64/bcache/attach 
>    gargamel:/dev# cat /sys/block/md7/bcache/cache_mode
>    [writethrough] writeback writearound none

Maybe try LVM Cache this time?

> 3) cryptsetup luksFormat --align-payload=2048 -s 256 -c aes-xts-plain64  /dev/bcache64
> 4) cryptsetup luksOpen /dev/bcache64 dshelf1

What's the threat scenario for LUKS on the array?

A major one for me would be not to be having to RMA a disk with all my data
still on the platters. But with RAID5, a single disk by itself would not
contain easily discernible or usable data. Or if you're protecting against
unauthorized access to the entire array, then never mind.

> 5) mkfs.btrfs -m dup -L dshelf1 /dev/mapper/dshelf1

Personally I have switched from Btrfs on MD to individual disks and MergerFS.

The rationale for no RAID is the simplicity and resilience of the individual
single-disk filesystems, and that anything important or not easily
re-obtainable is backed up anyways; so the protection from single disk
failures is not as important, compared to the introduced complexity and the
chance of losing the entire huge FS (like you had).

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
       [not found]   ` <CAK-xaQYc1PufsvksqP77HMe4ZVTkWuRDn2C3P-iMTQzrbQPLGQ@mail.gmail.com>
@ 2022-06-11 14:52     ` Marc MERLIN
  2022-06-11 17:54       ` Roman Mamedov
                         ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Marc MERLIN @ 2022-06-11 14:52 UTC (permalink / raw)
  To: Andrea Gelmini, Roman Mamedov
  Cc: Andrei Borzenkov, Zygo Blaxell, Josef Bacik, Chris Murphy,
	Qu Wenruo, linux-btrfs

On Sat, Jun 11, 2022 at 09:27:57AM +0200, Andrea Gelmini wrote:
> Il giorno sab 11 giu 2022 alle ore 09:16 Marc MERLIN
> <marc@merlins.org> ha scritto:
> > As for bcache, I'm really thinking about droppping it, unless I'm told
> > it should be safe to use.
> 
> https://lwn.net/Articles/895266/

Mmmh, bcachefs, I was not aware of this new one. Not sure if I want to
add yet another layer, esepcially if it's new.

On Sat, Jun 11, 2022 at 02:30:33PM +0500, Roman Mamedov wrote:
> > 2) echo 0fb96f02-d8da-45ce-aba7-070a1a8420e3 >  /sys/block/bcache64/bcache/attach 
> >    gargamel:/dev# cat /sys/block/md7/bcache/cache_mode
> >    [writethrough] writeback writearound none
> 
> Maybe try LVM Cache this time?
 
Hard to know either way, trading one layer for another, and LVM has
always seemed heavier

> > 3) cryptsetup luksFormat --align-payload=2048 -s 256 -c aes-xts-plain64  /dev/bcache64
> > 4) cryptsetup luksOpen /dev/bcache64 dshelf1
> 
> What's the threat scenario for LUKS on the array?

In case my computer gets stolen, and indeed being able to recycle old
drives without having to worry, is a nice bonus.

> > 5) mkfs.btrfs -m dup -L dshelf1 /dev/mapper/dshelf1
> 
> Personally I have switched from Btrfs on MD to individual disks and MergerFS.
 
That gives you no redundancy if a drive disk, correct?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
 
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-11 14:52     ` Marc MERLIN
@ 2022-06-11 17:54       ` Roman Mamedov
  2022-06-12 17:31         ` Marc MERLIN
  2022-06-12 21:21       ` Roman Mamedov
  2022-06-20 20:37       ` Andrea Gelmini
  2 siblings, 1 reply; 24+ messages in thread
From: Roman Mamedov @ 2022-06-11 17:54 UTC (permalink / raw)
  To: Marc MERLIN
  Cc: Andrea Gelmini, Andrei Borzenkov, Zygo Blaxell, Josef Bacik,
	Chris Murphy, Qu Wenruo, linux-btrfs

On Sat, 11 Jun 2022 07:52:59 -0700
Marc MERLIN <marc@merlins.org> wrote:

> 1) mdadm --create /dev/md7 --level=5 --consistency-policy=ppl
> --raid-devices=5 /dev/sd[abdef]1 --chunk=256 --bitmap=internal

One more thing I wanted to mention, did you have PPL on your previous array?
Or it was not implemented yet back then? I know it is supposed to protect
against the write hole, which could have caused your previous FS corruption.

> > > 5) mkfs.btrfs -m dup -L dshelf1 /dev/mapper/dshelf1
> > 
> > Personally I have switched from Btrfs on MD to individual disks and MergerFS.
>  
> That gives you no redundancy if a drive disk, correct?

Yes, but in MergerFS each file is stored entirely within a single disk,
there's no striping. So only files which happened to be on the failed disk are
lost and need to be restored from backups. For this it helps to keep track of
what was where, with something like "find /mnt/ > `date`.lst" in crontab.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-11  4:51 Suggestions for building new 44TB Raid5 array Marc MERLIN
  2022-06-11  9:30 ` Roman Mamedov
@ 2022-06-11 23:44 ` Zygo Blaxell
  2022-06-14 11:03 ` ronnie sahlberg
       [not found] ` <5e1733e6-471e-e7cb-9588-3280e659bfc2@aqueos.com>
  3 siblings, 0 replies; 24+ messages in thread
From: Zygo Blaxell @ 2022-06-11 23:44 UTC (permalink / raw)
  To: Marc MERLIN
  Cc: Andrei Borzenkov, Josef Bacik, linux-btrfs, Chris Murphy, Qu Wenruo

On Fri, Jun 10, 2022 at 09:51:20PM -0700, Marc MERLIN wrote:
> so, my apologies to all for the thread of death that is hopefully going
> to be over soon. I still want to help Josef fix the tools though,
> hopefully we'll get that filesystem back to a mountable state.
> 
> That said, it's been over 2 months now, and I do need to get this
> filesystem back up from backup, so I ended up buying new drives (5x
> 11TiB in raid5).
> 
> Given the pretty massive corruption that happened in ways that I still 
> can't explain, I'll make sure to turn off all the drive write caches
> but I think I'm not sure I want to trust bcache anymore even though
> I had it in writethrough mode.
> 
> Here's the Email from March, questions still apply:
> 
> Kernel will be 5.16. Filesystem will be 24TB and contain mostly bigger
> files (100MB to 10GB).
> 
> 1) mdadm --create /dev/md7 --level=5 --consistency-policy=ppl --raid-devices=5 /dev/sd[abdef]1 --chunk=256 --bitmap=internal
> 2) echo 0fb96f02-d8da-45ce-aba7-070a1a8420e3 >  /sys/block/bcache64/bcache/attach 
>    gargamel:/dev# cat /sys/block/md7/bcache/cache_mode
>    [writethrough] writeback writearound none
> 3) cryptsetup luksFormat --align-payload=2048 -s 256 -c aes-xts-plain64  /dev/bcache64
> 4) cryptsetup luksOpen /dev/bcache64 dshelf1
> 5) mkfs.btrfs -m dup -L dshelf1 /dev/mapper/dshelf1
> 
> Any other btrfs options I should set for format to improve reliability
> first and performance second?

> I'm told I should use space_cache=v2, is it default now with btrfs-progs 5.10.1-2 ?

It's default with current btrfs-progs.  I'm not sure what the cutoff
version is, but it doesn't matter--you can convert to v2 on first mount,
which will be fast on an empty filesystem.

> As for bcache, I'm really thinking about droppping it, unless I'm told
> it should be safe to use.

I would not recommend the cache in this configuration for resilience
because it doesn't keep device failures in separate failure domains.
Common SSD failure modes (e.g.  silent data corruption, dropped writes)
can be detected but not repaired, and can affect any part of the
filesystem when viewed through the cache.

Unfortunately cache is only resilient with btrfs raid1 using SSD+HDD
cached device pairs so that a failure of any SSD or HDD affects at most
one btrfs device.  That configuration works reasonably well, but you'll
need a pile more disks (both HDD and SSD) to match the capacity.

btrfs raid5 of SSD+HDD devices doesn't work--it will keep all IO accesses
below the cache's sequential IO size cutoff, which will wear out the SSDs
too fast (in addition to the other btrfs raid5 problems).  Same problem
with raid10 or raid0.

I've tested btrfs with both bcache and lvmcache.  I mostly use lvmcache,
and have had no problems with it.  bcache had problems in testing, so
I've never used bcache outside of test environments.

bcache has a few sharp edges when SSD devices fail that prevent
recovery with the filesystem still online.  It seems to trigger
service-interrupting firmware bugs in some SSD models with
its access patterns compared to lvmcache (failures that are
common on one vendor/model/firmware that never happen on any other
vendor/model/firmware, and that occur much more often, or at all, when
bcache is in use compared to when bcache is not in use).

I have not lost data with bcache when SSD corruption is not present--it
survived hundreds of power-fail crash test cycles and came back after
all the SSD firmware crashes in testing--but the service interruptions
from crashing firmware and the inability to recover from a failed drive
while keeping the filesystem online were a problem.  We worked around
this by using lvmcache instead.

If your IO subsystem has problems with write dropping, then it's going
to be much worse with any cache.  Neither bcache nor lvmcache have
any sort of hardening against SSD corruption or failure.  They both
fail badly on SSD corruption tests even in writethrough mode.

> Thanks,
> Marc
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
>  
> Home page: http://marc.merlins.org/  
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-11 17:54       ` Roman Mamedov
@ 2022-06-12 17:31         ` Marc MERLIN
  0 siblings, 0 replies; 24+ messages in thread
From: Marc MERLIN @ 2022-06-12 17:31 UTC (permalink / raw)
  To: Roman Mamedov
  Cc: Andrea Gelmini, Andrei Borzenkov, Zygo Blaxell, Josef Bacik,
	Chris Murphy, Qu Wenruo, linux-btrfs

On Sat, Jun 11, 2022 at 10:54:16PM +0500, Roman Mamedov wrote:
> On Sat, 11 Jun 2022 07:52:59 -0700
> Marc MERLIN <marc@merlins.org> wrote:
> 
> > 1) mdadm --create /dev/md7 --level=5 --consistency-policy=ppl
> > --raid-devices=5 /dev/sd[abdef]1 --chunk=256 --bitmap=internal
> 
> One more thing I wanted to mention, did you have PPL on your previous array?
> Or it was not implemented yet back then? I know it is supposed to protect
> against the write hole, which could have caused your previous FS corruption.
 
Looks like I had internal bitmap instead
gargamel:~# mdadm --query --detail  /dev/md7
/dev/md7:
           Version : 1.2
     Creation Time : Sun Feb 11 20:38:30 2018
        Raid Level : raid5
        Array Size : 23441561600 (22355.62 GiB 24004.16 GB)
     Used Dev Size : 5860390400 (5588.90 GiB 6001.04 GB)
      Raid Devices : 5
     Total Devices : 5
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Fri Jun 10 12:09:08 2022
             State : clean 
    Active Devices : 5
   Working Devices : 5
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

I'll switch PPL instead, thanks for that. Actually I need to migrate
my other raid5 arrays to that too. It looks like it can be done at runtime.

> Yes, but in MergerFS each file is stored entirely within a single disk,
> there's no striping. So only files which happened to be on the failed disk are
> lost and need to be restored from backups. For this it helps to keep track of
> what was where, with something like "find /mnt/ > `date`.lst" in crontab.

Right, I figured. It's not bad, but I do want no data loss if I lose a
drive, so I'll take raid5.
I realize that filesystem aware raid5, like the raid5 in btrfs which I'm
not sure I can really trust, still? , could lay out the files to be one per disk
without striping.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
 
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-11 14:52     ` Marc MERLIN
  2022-06-11 17:54       ` Roman Mamedov
@ 2022-06-12 21:21       ` Roman Mamedov
  2022-06-13 17:46         ` Marc MERLIN
  2022-06-13 18:13         ` Marc MERLIN
  2022-06-20 20:37       ` Andrea Gelmini
  2 siblings, 2 replies; 24+ messages in thread
From: Roman Mamedov @ 2022-06-12 21:21 UTC (permalink / raw)
  To: Marc MERLIN
  Cc: Andrea Gelmini, Andrei Borzenkov, Zygo Blaxell, Josef Bacik,
	Chris Murphy, Qu Wenruo, linux-btrfs

On Sat, 11 Jun 2022 07:52:59 -0700
Marc MERLIN <marc@merlins.org> wrote:

> On Sat, Jun 11, 2022 at 02:30:33PM +0500, Roman Mamedov wrote:
> > > 2) echo 0fb96f02-d8da-45ce-aba7-070a1a8420e3 >  /sys/block/bcache64/bcache/attach 
> > >    gargamel:/dev# cat /sys/block/md7/bcache/cache_mode
> > >    [writethrough] writeback writearound none
> > 
> > Maybe try LVM Cache this time?
>  
> Hard to know either way, trading one layer for another, and LVM has
> always seemed heavier

I'd suggest to put the LUKS volume onto an LV still (in case you don't), so you
can add and remove cache just to see how it works; unlike with bcache, an LVM
cache can be added to an existing LV and then removed without a trace, all
without having to displace 44 TB of data for that.

And plain no-cache LVM doesn't add much in terms of being a "layer".

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-12 21:21       ` Roman Mamedov
@ 2022-06-13 17:46         ` Marc MERLIN
  2022-06-13 18:06           ` Roman Mamedov
  2022-06-13 18:10           ` Zygo Blaxell
  2022-06-13 18:13         ` Marc MERLIN
  1 sibling, 2 replies; 24+ messages in thread
From: Marc MERLIN @ 2022-06-13 17:46 UTC (permalink / raw)
  To: Roman Mamedov
  Cc: Andrea Gelmini, Andrei Borzenkov, Zygo Blaxell, Josef Bacik,
	Chris Murphy, Qu Wenruo, linux-btrfs

On Mon, Jun 13, 2022 at 02:21:07AM +0500, Roman Mamedov wrote:
> On Sat, 11 Jun 2022 07:52:59 -0700
> Marc MERLIN <marc@merlins.org> wrote:
> 
> > On Sat, Jun 11, 2022 at 02:30:33PM +0500, Roman Mamedov wrote:
> > > > 2) echo 0fb96f02-d8da-45ce-aba7-070a1a8420e3 >  /sys/block/bcache64/bcache/attach 
> > > >    gargamel:/dev# cat /sys/block/md7/bcache/cache_mode
> > > >    [writethrough] writeback writearound none
> > > 
> > > Maybe try LVM Cache this time?
> >  
> > Hard to know either way, trading one layer for another, and LVM has
> > always seemed heavier
> 
> I'd suggest to put the LUKS volume onto an LV still (in case you don't), so you
> can add and remove cache just to see how it works; unlike with bcache, an LVM
> cache can be added to an existing LV and then removed without a trace, all
> without having to displace 44 TB of data for that.

Thanks. I've always felt that LVM was heavyweight and required extra
steps and tools, so I've been avoiding it, but maybe that wasn't
rational.
bcache by the way, you can set it up without a backing device and then
use it normally without the cache layer. I think it's actually pretty
similar, but you have to set it up beforehand (just like LVM)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
 
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-13 17:46         ` Marc MERLIN
@ 2022-06-13 18:06           ` Roman Mamedov
  2022-06-14  4:51             ` Marc MERLIN
  2022-06-13 18:10           ` Zygo Blaxell
  1 sibling, 1 reply; 24+ messages in thread
From: Roman Mamedov @ 2022-06-13 18:06 UTC (permalink / raw)
  To: Marc MERLIN
  Cc: Andrea Gelmini, Andrei Borzenkov, Zygo Blaxell, Josef Bacik,
	Chris Murphy, Qu Wenruo, linux-btrfs

On Mon, 13 Jun 2022 10:46:40 -0700
Marc MERLIN <marc@merlins.org> wrote:

> bcache by the way, you can set it up without a backing device and then
> use it normally without the cache layer. I think it's actually pretty
> similar, but you have to set it up beforehand (just like LVM)

What I mean is bcache in this way stays bcache-without-a-cache forever, which
feels odd; it still goes through the bcache code, has the module loaded, keeps
the device name, etc;

Whereas in LVM caching is a completely optional side-feature, and many people
would just run LVM in any case, not even thinking about enabling cache. LVM is
basically "the next generation" of disk partitions, with way more features,
but not much more overhead.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-13 17:46         ` Marc MERLIN
  2022-06-13 18:06           ` Roman Mamedov
@ 2022-06-13 18:10           ` Zygo Blaxell
  1 sibling, 0 replies; 24+ messages in thread
From: Zygo Blaxell @ 2022-06-13 18:10 UTC (permalink / raw)
  To: Marc MERLIN
  Cc: Roman Mamedov, Andrea Gelmini, Andrei Borzenkov, Josef Bacik,
	Chris Murphy, Qu Wenruo, linux-btrfs

On Mon, Jun 13, 2022 at 10:46:40AM -0700, Marc MERLIN wrote:
> On Mon, Jun 13, 2022 at 02:21:07AM +0500, Roman Mamedov wrote:
> > On Sat, 11 Jun 2022 07:52:59 -0700
> > Marc MERLIN <marc@merlins.org> wrote:
> > 
> > > On Sat, Jun 11, 2022 at 02:30:33PM +0500, Roman Mamedov wrote:
> > > > > 2) echo 0fb96f02-d8da-45ce-aba7-070a1a8420e3 >  /sys/block/bcache64/bcache/attach 
> > > > >    gargamel:/dev# cat /sys/block/md7/bcache/cache_mode
> > > > >    [writethrough] writeback writearound none
> > > > 
> > > > Maybe try LVM Cache this time?
> > >  
> > > Hard to know either way, trading one layer for another, and LVM has
> > > always seemed heavier
> > 
> > I'd suggest to put the LUKS volume onto an LV still (in case you don't), so you
> > can add and remove cache just to see how it works; unlike with bcache, an LVM
> > cache can be added to an existing LV and then removed without a trace, all
> > without having to displace 44 TB of data for that.
> 
> Thanks. I've always felt that LVM was heavyweight and required extra
> steps and tools, so I've been avoiding it, but maybe that wasn't
> rational.
> bcache by the way, you can set it up without a backing device and then
> use it normally without the cache layer. I think it's actually pretty
> similar, but you have to set it up beforehand (just like LVM)

You can trivially convert from lvmcache to plain LV on the fly.  It's a
pretty essential capability for long-term maintenance, since you can't
move or resize the LV while it's cached.

If you have a LV and you want it to be cached with bcache, you can hack
up the LVM configuration after the fact with https://github.com/g2p/blocks


> Marc
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
>  
> Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-12 21:21       ` Roman Mamedov
  2022-06-13 17:46         ` Marc MERLIN
@ 2022-06-13 18:13         ` Marc MERLIN
  2022-06-13 18:29           ` Roman Mamedov
  2022-06-13 20:08           ` Zygo Blaxell
  1 sibling, 2 replies; 24+ messages in thread
From: Marc MERLIN @ 2022-06-13 18:13 UTC (permalink / raw)
  To: Roman Mamedov
  Cc: Andrea Gelmini, Andrei Borzenkov, Zygo Blaxell, Josef Bacik,
	Chris Murphy, Qu Wenruo, linux-btrfs

On Mon, Jun 13, 2022 at 02:21:07AM +0500, Roman Mamedov wrote:
> I'd suggest to put the LUKS volume onto an LV still (in case you don't), so you
> can add and remove cache just to see how it works; unlike with bcache, an LVM

In case I decide to give that a shot, what would the actual LVM
command(s) look like to create a null LVM? You'd just make a single PV
using the cryptestup decrypted version of the mdadm raid5 and then an LV
that takes all of it, but after the fact you can modify the LV and add a
cache?

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
 
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-13 18:13         ` Marc MERLIN
@ 2022-06-13 18:29           ` Roman Mamedov
  2022-06-13 20:08           ` Zygo Blaxell
  1 sibling, 0 replies; 24+ messages in thread
From: Roman Mamedov @ 2022-06-13 18:29 UTC (permalink / raw)
  To: Marc MERLIN
  Cc: Andrea Gelmini, Andrei Borzenkov, Zygo Blaxell, Josef Bacik,
	Chris Murphy, Qu Wenruo, linux-btrfs

On Mon, 13 Jun 2022 11:13:22 -0700
Marc MERLIN <marc@merlins.org> wrote:

> On Mon, Jun 13, 2022 at 02:21:07AM +0500, Roman Mamedov wrote:
> > I'd suggest to put the LUKS volume onto an LV still (in case you don't), so you
> > can add and remove cache just to see how it works; unlike with bcache, an LVM
> 
> In case I decide to give that a shot, what would the actual LVM
> command(s) look like to create a null LVM? You'd just make a single PV
> using the cryptestup decrypted version of the mdadm raid5 

It is a question of whether you want to cache encrypted, or plain-text data. I
guess the former should be preferable, for a complete peace-of-mind against
data forensics vs the cache device, but with a toll on performance, due to the
need to re-decrypt even the cache hits each time.

In case of caching encrypted, it's:

mdraid => PV => LV => LUKS
                |
             (cache)

Otherwise:

mdraid => LUKS => PV => LV
                        |
                     (cache)

For the actual commands see e.g.
https://tomlankhorst.nl/setup-lvm-raid-array-mdadm-linux#set-up-logical-volume-management-lvm

> an LV that takes all of it, but after the fact you can modify the LV and add
> a cache?

Yes.

-- 
With respect,
Roman


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-13 18:13         ` Marc MERLIN
  2022-06-13 18:29           ` Roman Mamedov
@ 2022-06-13 20:08           ` Zygo Blaxell
  2022-06-14  6:36             ` Torbjörn Jansson
  1 sibling, 1 reply; 24+ messages in thread
From: Zygo Blaxell @ 2022-06-13 20:08 UTC (permalink / raw)
  To: Marc MERLIN
  Cc: Roman Mamedov, Andrea Gelmini, Andrei Borzenkov, Josef Bacik,
	Chris Murphy, Qu Wenruo, linux-btrfs

On Mon, Jun 13, 2022 at 11:13:22AM -0700, Marc MERLIN wrote:
> On Mon, Jun 13, 2022 at 02:21:07AM +0500, Roman Mamedov wrote:
> > I'd suggest to put the LUKS volume onto an LV still (in case you don't), so you
> > can add and remove cache just to see how it works; unlike with bcache, an LVM
> 
> In case I decide to give that a shot, what would the actual LVM
> command(s) look like to create a null LVM? You'd just make a single PV
> using the cryptestup decrypted version of the mdadm raid5 and then an LV
> that takes all of it, but after the fact you can modify the LV and add a
> cache?

Some variables:

	vg=name of VG...
	device=name of cache device (SSD) PV...
	base=name of existing backing (HDD) LV...
	meta=meta$base
	pool=pool$base

Add a cache LV to an existing LV with:

	lvcreate $vg -n $meta -L 1G $device
	lvcreate $vg -n $pool -l 90%PVS $device
	lvconvert -f --type cache-pool --poolmetadata $vg/$meta $vg/$pool
	lvconvert -f --type cache --cachepool $vg/$pool $vg/$data --cachemode writethrough

Uncache with:

	lvconvert -f --uncache $vg/$data

Note that 'lvconvert' will flush the entire cache back to the backing
store during uncache at minimum IO priority, so it will take some time
and can be prolonged indefinitely by a continuous IO workload on top.
Also, the uncache operation will propagate any corruption in the SSD
cache back to the HDD LV, even in writethrough mode.

> Mart
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
>  
> Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-13 18:06           ` Roman Mamedov
@ 2022-06-14  4:51             ` Marc MERLIN
  0 siblings, 0 replies; 24+ messages in thread
From: Marc MERLIN @ 2022-06-14  4:51 UTC (permalink / raw)
  To: Roman Mamedov, Zygo Blaxell
  Cc: Andrea Gelmini, Andrei Borzenkov, Josef Bacik, Chris Murphy,
	Qu Wenruo, linux-btrfs

Thanks to you both for your kind help. If I'm rebulding everything,
might as well future-proof it as well as possible.

On Mon, Jun 13, 2022 at 11:06:25PM +0500, Roman Mamedov wrote:
> What I mean is bcache in this way stays bcache-without-a-cache forever, which
> feels odd; it still goes through the bcache code, has the module loaded, keeps
> the device name, etc;
 
Fair point. I have done that, but I see what you're saying.

> Whereas in LVM caching is a completely optional side-feature, and many people
> would just run LVM in any case, not even thinking about enabling cache. LVM is
> basically "the next generation" of disk partitions, with way more features,
> but not much more overhead.

Fair enough. I have used LVM for many years, since the now defunct lvm1,
and I've run through a fair amount of issues, some reliability, some
performance. that was many many years ago though, so I'll take your word
for it that it's a lot more lightweight and safe now.

Actually I think I stopped using LVM the same time I started using
btrfs, because effectively btrfs subvolumes were close enough to LVM LVs
for my use, but yes I understand that different LVs are actually
different filesystems and you can do extra stuff like caching.

Actually I have another array where there were so many files and
snapshots that I split it into different LVs with dm-thin so that I
didn't stress the btrfs code too much (which I'm told gets unhappy when
you have hundreds of snapshots).

On Mon, Jun 13, 2022 at 02:10:56PM -0400, Zygo Blaxell wrote:
> You can trivially convert from lvmcache to plain LV on the fly.  It's a
> pretty essential capability for long-term maintenance, since you can't
> move or resize the LV while it's cached.
> 
> If you have a LV and you want it to be cached with bcache, you can hack
> up the LVM configuration after the fact with https://github.com/g2p/blocks
 
Got it, thanks much.

On Mon, Jun 13, 2022 at 11:29:07PM +0500, Roman Mamedov wrote:
> It is a question of whether you want to cache encrypted, or plain-text data. I
> guess the former should be preferable, for a complete peace-of-mind against
> data forensics vs the cache device, but with a toll on performance, due to the
> need to re-decrypt even the cache hits each time.
 
Right, I know that tradeoff. Also, LUKS makes things a bit more complicated
if you want to grow the FS.

> In case of caching encrypted, it's:
> 
> mdraid => PV => LV => LUKS
>                 |
>              (cache)
> 
> Otherwise:
> 
> mdraid => LUKS => PV => LV
>                         |
>                      (cache)

Right. I'll probably do that.

On Mon, Jun 13, 2022 at 04:08:07PM -0400, Zygo Blaxell wrote:
> Add a cache LV to an existing LV with:
> 
> 	lvcreate $vg -n $meta -L 1G $device
> 	lvcreate $vg -n $pool -l 90%PVS $device
> 	lvconvert -f --type cache-pool --poolmetadata $vg/$meta $vg/$pool
> 	lvconvert -f --type cache --cachepool $vg/$pool $vg/$data --cachemode writethrough
> 
> Uncache with:
> 
> 	lvconvert -f --uncache $vg/$data
> 
> Note that 'lvconvert' will flush the entire cache back to the backing
> store during uncache at minimum IO priority, so it will take some time
> and can be prolonged indefinitely by a continuous IO workload on top.
> Also, the uncache operation will propagate any corruption in the SSD
> cache back to the HDD LV, even in writethrough mode.

Thanks much for the heads up.

Best,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
 
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-13 20:08           ` Zygo Blaxell
@ 2022-06-14  6:36             ` Torbjörn Jansson
  0 siblings, 0 replies; 24+ messages in thread
From: Torbjörn Jansson @ 2022-06-14  6:36 UTC (permalink / raw)
  To: Zygo Blaxell, Marc MERLIN
  Cc: Roman Mamedov, Andrea Gelmini, Andrei Borzenkov, Josef Bacik,
	Chris Murphy, Qu Wenruo, linux-btrfs

On 2022-06-13 22:08, Zygo Blaxell wrote:
> On Mon, Jun 13, 2022 at 11:13:22AM -0700, Marc MERLIN wrote:
>> On Mon, Jun 13, 2022 at 02:21:07AM +0500, Roman Mamedov wrote:
>>> I'd suggest to put the LUKS volume onto an LV still (in case you don't), so you
>>> can add and remove cache just to see how it works; unlike with bcache, an LVM
>>
>> In case I decide to give that a shot, what would the actual LVM
>> command(s) look like to create a null LVM? You'd just make a single PV
>> using the cryptestup decrypted version of the mdadm raid5 and then an LV
>> that takes all of it, but after the fact you can modify the LV and add a
>> cache?
> 
> Some variables:
> 
> 	vg=name of VG...
> 	device=name of cache device (SSD) PV...
> 	base=name of existing backing (HDD) LV...
> 	meta=meta$base
> 	pool=pool$base
> 
> Add a cache LV to an existing LV with:
> 
> 	lvcreate $vg -n $meta -L 1G $device
> 	lvcreate $vg -n $pool -l 90%PVS $device
> 	lvconvert -f --type cache-pool --poolmetadata $vg/$meta $vg/$pool
> 	lvconvert -f --type cache --cachepool $vg/$pool $vg/$data --cachemode writethrough
> 
> Uncache with:
> 
> 	lvconvert -f --uncache $vg/$data
> 
> Note that 'lvconvert' will flush the entire cache back to the backing
> store during uncache at minimum IO priority, so it will take some time
> and can be prolonged indefinitely by a continuous IO workload on top.
> Also, the uncache operation will propagate any corruption in the SSD
> cache back to the HDD LV, even in writethrough mode.
> 

Personally when i setup lvmcache i always use the "all in one" command to 
create it.
And if i forget the syntax because the man pages are a bit unclear on how to do 
it exactly then i got to: https://wiki.archlinux.org/title/LVM#Cache
for a refresher.

It is something like:
lvcreate --type cache --cachemode writethrough -l 100%FREE -n root_cachepool 
MyVolGroup/rootvol /dev/fastdisk

i usually change -l to -L and specify the size of the cache there and the name 
(-n) is not to important since the LV will still be named just the same as 
before the cache was enabled.
so this name is "just" something that shows up in for example lvs output.

only type=cache allows you to create and remove the cache with a live 
filesystem, the other type writecache requires filesystem and probably also the 
lv to be deactivated.
don't remmeber exactly but was more effort to turn on/off writecache vs normal 
cache

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-11  4:51 Suggestions for building new 44TB Raid5 array Marc MERLIN
  2022-06-11  9:30 ` Roman Mamedov
  2022-06-11 23:44 ` Zygo Blaxell
@ 2022-06-14 11:03 ` ronnie sahlberg
       [not found] ` <5e1733e6-471e-e7cb-9588-3280e659bfc2@aqueos.com>
  3 siblings, 0 replies; 24+ messages in thread
From: ronnie sahlberg @ 2022-06-14 11:03 UTC (permalink / raw)
  To: Marc MERLIN
  Cc: Andrei Borzenkov, Zygo Blaxell, Josef Bacik, linux-btrfs,
	Chris Murphy, Qu Wenruo

On Sat, 11 Jun 2022 at 17:16, Marc MERLIN <marc@merlins.org> wrote:
>
> so, my apologies to all for the thread of death that is hopefully going
> to be over soon. I still want to help Josef fix the tools though,
> hopefully we'll get that filesystem back to a mountable state.
>
> That said, it's been over 2 months now, and I do need to get this
> filesystem back up from backup, so I ended up buying new drives (5x
> 11TiB in raid5).
>
> Given the pretty massive corruption that happened in ways that I still
> can't explain, I'll make sure to turn off all the drive write caches
> but I think I'm not sure I want to trust bcache anymore even though
> I had it in writethrough mode.
>
> Here's the Email from March, questions still apply:
>
> Kernel will be 5.16. Filesystem will be 24TB and contain mostly bigger
> files (100MB to 10GB).
>
> 1) mdadm --create /dev/md7 --level=5 --consistency-policy=ppl --raid-devices=5 /dev/sd[abdef]1 --chunk=256 --bitmap=internal
> 2) echo 0fb96f02-d8da-45ce-aba7-070a1a8420e3 >  /sys/block/bcache64/bcache/attach
>    gargamel:/dev# cat /sys/block/md7/bcache/cache_mode
>    [writethrough] writeback writearound none
> 3) cryptsetup luksFormat --align-payload=2048 -s 256 -c aes-xts-plain64  /dev/bcache64
> 4) cryptsetup luksOpen /dev/bcache64 dshelf1
> 5) mkfs.btrfs -m dup -L dshelf1 /dev/mapper/dshelf1
>
> Any other btrfs options I should set for format to improve reliability
> first and performance second?
> I'm told I should use space_cache=v2, is it default now with btrfs-progs 5.10.1-2 ?
>
> As for bcache, I'm really thinking about droppping it, unless I'm told
> it should be safe to use.
>
> Thanks,
> Marc

My needs are much more basic.
I have a LOT of large files. ISO images and QEMU disk images. I also
have hundreds of thousands of photos.

I used different multi-disk solutions but found them all too fragile
so now, last 8 years, I have used a setup that is basically
5 disks each with their own EXT4 filesystem ontop of LUKS and two
additional drives to have 2 disk parity in snapraid.
Now, snapraid does not do in-line raid updates so I carefully manage
how I handle the data.
Audio, photos and QEMU base images are immutable so these files are
not a problem.
For VM images I have for each machine an immutable 'base' image that
snapraid takes care of and I have live images based on that that are
not
handled by snapraid.  (qemu-img create -b base.img cow.img)
If a live image goes corrupt due to a poweroutage or similar I just
re-create it ontop of the latest archived base image.
As I often do stuff to the VM images that cause kernel panics, this is
a very convenient way to restore them quickly and with little effort
to a known good state.

If one disk has a catastrophic failure, I only lose the files on that
particular disk and just have to restore them but nothing else.
Now I do export them as 5 different filesystems/shares  but that is
just because I am too lazy to set up some kind of "merge fs".

If your use case is mostly-read and mostly-archive this might work for
you too and it is VERY reliable.
Ease of mind knowing that if a single disk dies I do not have a total
dataloss scenario. If the Windows VM disk dies, I just restore those
images while all the other disks are still online.


It is simple, primitive and 1980 type of technology but it works. And
it is reliable.

> --
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
>
> Home page: http://marc.merlins.org/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
       [not found] ` <5e1733e6-471e-e7cb-9588-3280e659bfc2@aqueos.com>
@ 2022-06-20 15:01   ` Marc MERLIN
  2022-06-20 15:52     ` Ghislain Adnet
  2022-06-20 17:02     ` Andrei Borzenkov
  0 siblings, 2 replies; 24+ messages in thread
From: Marc MERLIN @ 2022-06-20 15:01 UTC (permalink / raw)
  To: Ghislain Adnet; +Cc: linux-btrfs

>  I have a stupid question to ask : Why use btrfs here ? Is not mdamd+xfs good enough ?
 
I use btrfs for historical snapshots and btrfs send/receive remote
backups.

>  If you want snapshot why not use ZFS then ? i try to use btrfs myself and meet a lot of issues with it that i did not had with mdadm+ext4. Perhaps btrfs is not suited to that use (here raid5) ?

ZFS is not GPL compatible and out of tree.

>  ZFS has crypt, raid5 like array and snapshot and LARC cache allready so no need to add 4 layer on it. It seems a solution for you.

It has a few of its own issues, but yes, if it were actually GPL
compatible and in the linux kernel source tree, I'd consider it.

It's also owned by a company (Oracle) that has tried to sue others for
billions of dollars over software patents, or even an algorithm, i.e.
not a company I'm willing to trust by any means.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
 
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-20 15:01   ` Marc MERLIN
@ 2022-06-20 15:52     ` Ghislain Adnet
  2022-06-20 16:27       ` Marc MERLIN
  2022-06-20 17:02     ` Andrei Borzenkov
  1 sibling, 1 reply; 24+ messages in thread
From: Ghislain Adnet @ 2022-06-20 15:52 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs


> 
> It has a few of its own issues, but yes, if it were actually GPL
> compatible and in the linux kernel source tree, I'd consider it.
> 
> It's also owned by a company (Oracle) that has tried to sue others for
> billions of dollars over software patents, or even an algorithm, i.e.
> not a company I'm willing to trust by any means.
> 

well i completly understand i use btrfs for the same reason but it seems on your side that this use case is a little far from the features provided.
The more layer i use the more i fear a Pise tower syndrome :)

good luck for the setup !

-- 
cordialement,
Ghislain


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-20 15:52     ` Ghislain Adnet
@ 2022-06-20 16:27       ` Marc MERLIN
  0 siblings, 0 replies; 24+ messages in thread
From: Marc MERLIN @ 2022-06-20 16:27 UTC (permalink / raw)
  To: Ghislain Adnet; +Cc: linux-btrfs

On Mon, Jun 20, 2022 at 05:52:14PM +0200, Ghislain Adnet wrote:
> well i completly understand i use btrfs for the same reason but
> it seems on your side that this use case is a little far from the
> features provided.
> The more layer i use the more i fear a Pise tower syndrome :)

I share that worry, but using ZFS simply isn't an option to me for the
reasons explained.
But indeed, I removed bcache as a layer.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
 
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-20 15:01   ` Marc MERLIN
  2022-06-20 15:52     ` Ghislain Adnet
@ 2022-06-20 17:02     ` Andrei Borzenkov
  2022-06-20 17:26       ` Marc MERLIN
  1 sibling, 1 reply; 24+ messages in thread
From: Andrei Borzenkov @ 2022-06-20 17:02 UTC (permalink / raw)
  To: Marc MERLIN, Ghislain Adnet; +Cc: linux-btrfs

On 20.06.2022 18:01, Marc MERLIN wrote:
>>  I have a stupid question to ask : Why use btrfs here ? Is not mdamd+xfs good enough ?
>  
> I use btrfs for historical snapshots and btrfs send/receive remote
> backups.
> 
>>  If you want snapshot why not use ZFS then ? i try to use btrfs myself and meet a lot of issues with it that i did not had with mdadm+ext4. Perhaps btrfs is not suited to that use (here raid5) ?
> 
> ZFS is not GPL compatible and out of tree.
> 
>>  ZFS has crypt, raid5 like array and snapshot and LARC cache allready so no need to add 4 layer on it. It seems a solution for you.
> 
> It has a few of its own issues, but yes, if it were actually GPL
> compatible and in the linux kernel source tree, I'd consider it.
> 
> It's also owned by a company (Oracle)

ZFS on Linux is not owned by Oracle to my best knowledge.

https://openzfs.github.io/openzfs-docs/License.html

 that has tried to sue others for
> billions of dollars over software patents, or even an algorithm, i.e.
> not a company I'm willing to trust by any means.
> 
> Marc


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-20 17:02     ` Andrei Borzenkov
@ 2022-06-20 17:26       ` Marc MERLIN
  0 siblings, 0 replies; 24+ messages in thread
From: Marc MERLIN @ 2022-06-20 17:26 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: Ghislain Adnet, linux-btrfs

On Mon, Jun 20, 2022 at 08:02:59PM +0300, Andrei Borzenkov wrote:
> ZFS on Linux is not owned by Oracle to my best knowledge.
> 
> https://openzfs.github.io/openzfs-docs/License.html
 
Oracle bought Sun and its patent portfolio in the process, including all
claims to any patents in ZFS. I simply will never trust them given what
they've already done.
I did give a full talk about this issue years ago.
https://marc.merlins.org/linux/talks/Btrfs-LC2014-JP/Btrfs.pdf
and go to page #5 and
https://www.theregister.com/2010/09/09/oracle_netapp_zfs_dismiss/
basically there likely are Netapp patents in ZFS too, but I'm less
worried about Netapp suing others for patents, and they did settle 
with Sun back in the days.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
 
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-11 14:52     ` Marc MERLIN
  2022-06-11 17:54       ` Roman Mamedov
  2022-06-12 21:21       ` Roman Mamedov
@ 2022-06-20 20:37       ` Andrea Gelmini
  2022-06-21  5:26         ` Zygo Blaxell
  2 siblings, 1 reply; 24+ messages in thread
From: Andrea Gelmini @ 2022-06-20 20:37 UTC (permalink / raw)
  To: Marc MERLIN
  Cc: Roman Mamedov, Andrei Borzenkov, Zygo Blaxell, Josef Bacik,
	Chris Murphy, Qu Wenruo, linux-btrfs

Il giorno sab 11 giu 2022 alle ore 16:53 Marc MERLIN
<marc@merlins.org> ha scritto:
> Mmmh, bcachefs, I was not aware of this new one. Not sure if I want to
> add yet another layer, esepcially if it's new.

I share the link just to say: bcache author works in a great way.
Bcachefs could be the idea to replace a few layers. Not to add new
one.

Just for the record.
I'm using a 120TB array with BTRFS and bcache as caching over raid1 2TB ssd.
I tried same setup with LVM, but - sadly - lvm tool complained about
maximum cache size (16GB max, if I remember exactly, anyway no way to
use the fully 2TB).
Sad, because nowhere they mentioned this.

Played a little bit with kernel source, but eventually didn't want to
risk too much with a server I want to use in production at work.

On my side, in the end, I really like cryptsetup for each HD, with
mergerfs and snapraid (for my home setup).
Very handy with replacing, playing, experimenting and so on.
Each time I tried one big single volume setup, eventually I regret it.

Ciao,
Gelma

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-20 20:37       ` Andrea Gelmini
@ 2022-06-21  5:26         ` Zygo Blaxell
  2022-07-06  9:09           ` Andrea Gelmini
  0 siblings, 1 reply; 24+ messages in thread
From: Zygo Blaxell @ 2022-06-21  5:26 UTC (permalink / raw)
  To: Andrea Gelmini
  Cc: Marc MERLIN, Roman Mamedov, Andrei Borzenkov, Josef Bacik,
	Chris Murphy, Qu Wenruo, linux-btrfs

On Mon, Jun 20, 2022 at 10:37:25PM +0200, Andrea Gelmini wrote:
> Il giorno sab 11 giu 2022 alle ore 16:53 Marc MERLIN
> <marc@merlins.org> ha scritto:
> > Mmmh, bcachefs, I was not aware of this new one. Not sure if I want to
> > add yet another layer, esepcially if it's new.
>
> I share the link just to say: bcache author works in a great way.
> Bcachefs could be the idea to replace a few layers. Not to add new
> one.
>
> Just for the record.
> I'm using a 120TB array with BTRFS and bcache as caching over raid1 2TB ssd.
> I tried same setup with LVM, but - sadly - lvm tool complained about
> maximum cache size (16GB max, if I remember exactly, anyway no way to
> use the fully 2TB).
> Sad, because nowhere they mentioned this.

How many years ago was this?  There's no such limit today.  Here is a
500TB LV with 4TB of cache:

	# lvs -o +cachemode
	  LV          VG         Attr       LSize   Pool                Origin        Data%  Meta%  Move Log Cpy%Sync Convert CacheMode
	  lvol0       tv         Cwi-aoC--- 500.00t [lvol0_cache0_cvol] [lvol0_corig] 0.19   27.78           34.14            writeback
	  lvol0_cache tv         -wi-a-----   4.00t

There is a limit on metadata size, but you can override it in lvm.conf.
Presumably if you have a 100TB+ filesystem, you also have enough RAM lying
around to make the metadata size larger (it's 2.8GB at chunk size 128,
but that's not unreasonable for 4096GB of cache).

> Played a little bit with kernel source, but eventually didn't want to
> risk too much with a server I want to use in production at work.
>
> On my side, in the end, I really like cryptsetup for each HD, with
> mergerfs and snapraid (for my home setup).
> Very handy with replacing, playing, experimenting and so on.
> Each time I tried one big single volume setup, eventually I regret it.
>
> Ciao,
> Gelma

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Suggestions for building new 44TB Raid5 array
  2022-06-21  5:26         ` Zygo Blaxell
@ 2022-07-06  9:09           ` Andrea Gelmini
  0 siblings, 0 replies; 24+ messages in thread
From: Andrea Gelmini @ 2022-07-06  9:09 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: Marc MERLIN, Roman Mamedov, Andrei Borzenkov, Josef Bacik,
	Chris Murphy, Qu Wenruo, linux-btrfs

Il giorno mar 21 giu 2022 alle ore 07:26 Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> ha scritto:
> How many years ago was this?  There's no such limit today.  Here is a
> 500TB LV with 4TB of cache:
Good to know, thanks!

> There is a limit on metadata size, but you can override it in lvm.conf.
> Presumably if you have a 100TB+ filesystem, you also have enough RAM lying
> around to make the metadata size larger (it's 2.8GB at chunk size 128,
> but that's not unreasonable for 4096GB of cache).

Yeap. We have at least 48GB of RAM on each server.
Well, I use them all to run beesd on night (tweaked the sources)
Fun part, I found this project:
https://github.com/pkolaczk/fclones

It's a recent(!) file-based deduplicator in Rust, with parallelism on
every stages, hard/soft liks and range clones.
Perfect for my needs and blazing fast.

Ciao,
Gelma

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-07-06  9:13 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-11  4:51 Suggestions for building new 44TB Raid5 array Marc MERLIN
2022-06-11  9:30 ` Roman Mamedov
     [not found]   ` <CAK-xaQYc1PufsvksqP77HMe4ZVTkWuRDn2C3P-iMTQzrbQPLGQ@mail.gmail.com>
2022-06-11 14:52     ` Marc MERLIN
2022-06-11 17:54       ` Roman Mamedov
2022-06-12 17:31         ` Marc MERLIN
2022-06-12 21:21       ` Roman Mamedov
2022-06-13 17:46         ` Marc MERLIN
2022-06-13 18:06           ` Roman Mamedov
2022-06-14  4:51             ` Marc MERLIN
2022-06-13 18:10           ` Zygo Blaxell
2022-06-13 18:13         ` Marc MERLIN
2022-06-13 18:29           ` Roman Mamedov
2022-06-13 20:08           ` Zygo Blaxell
2022-06-14  6:36             ` Torbjörn Jansson
2022-06-20 20:37       ` Andrea Gelmini
2022-06-21  5:26         ` Zygo Blaxell
2022-07-06  9:09           ` Andrea Gelmini
2022-06-11 23:44 ` Zygo Blaxell
2022-06-14 11:03 ` ronnie sahlberg
     [not found] ` <5e1733e6-471e-e7cb-9588-3280e659bfc2@aqueos.com>
2022-06-20 15:01   ` Marc MERLIN
2022-06-20 15:52     ` Ghislain Adnet
2022-06-20 16:27       ` Marc MERLIN
2022-06-20 17:02     ` Andrei Borzenkov
2022-06-20 17:26       ` Marc MERLIN

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.