BTRFS and *CACHE setup [was Re: [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata]

From: Goffredo Baroncelli <kreijack@libero.it>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: linux-btrfs@vger.kernel.org, Michael <mclaud@roznica.com.ua>,
	Hugo Mills <hugo@carfax.org.uk>,
	Martin Svec <martin.svec@zoner.cz>,
	Wang Yugui <wangyugui@e16-tech.com>,
	Paul Jones <paul@pauljones.id.au>,
	Adam Borowski <kilobyte@angband.pl>
Subject: BTRFS and *CACHE setup [was Re: [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata]
Date: Fri, 8 Jan 2021 18:43:52 +0100	[thread overview]
Message-ID: <0dbec46b-8f46-afca-c61a-51b85300b0f2@libero.it> (raw)
In-Reply-To: <bc7d874f-3f8b-7eff-6d18-f9613e7c6972@libero.it>

On 1/8/21 6:30 PM, Goffredo Baroncelli wrote:
> On 1/8/21 2:05 AM, Zygo Blaxell wrote:
>> On Thu, May 28, 2020 at 08:34:47PM +0200, Goffredo Baroncelli wrote:
>>>
> [...]
>>
>> I've been testing these patches for a while now.  They enable an
>> interesting use case that can't otherwise be done safely, sanely or
>> cheaply with btrfs.
> 
> Thanks Zygo for this feedback. As usual you are source of very interesting considerations.
>>
>> Normally if we have an array of, say, 10 spinning disks, and we want to
>> implement a writeback cache layer with SSD, we would need 10 distinct SSD
>> devices to avoid reducing btrfs's ability to recover from drive failures.
>> The writeback cache will be modified on both reads and writes, data and
>> metadata, so we need high endurance SSDs if we want them to make it to
>> the end of their warranty.  The SSD firmware has to not have crippling
>> performance bugs while under heavy write load, which means we are now
>> restricted to an expensive subset of high endurance SSDs targeted at
>> the enterprise/NAS/video production markets...and we need 10 of them!
>>
>> NVME has fairly draconian restrictions on drive count, and getting
>> anything close to 10 of them into a btrfs filesystem can be an expensive
>> challenge.  (I'm not counting solutions that use USB-to-NVME bridges
>> because those don't count as "sane" or "safe").
>>
>> We can share the cache between disks, but not safely in writeback mode,
>> because a failure in one SSD could affect multiple logical btrfs disks.
>> Strictly speaking we can't do it safely in any cache mode, but at least
>> with a writethrough cache we can recover the btrfs by throwing the SSDs
>> away.
[...]

Hi Zygo,

could you elaborate the last sentence. What I understood is that in
write-through mode, the ordering (and the barrier) are preserved.
So this mode should be safe (bug a part).

If this is true, it would be possible to have a btrfs multi (spinning)
disks setup with only one SSD acting as cache. Of course, it will
works only in write-through mode, and the main beneficial are related
to caching the data for further next read.

Does anyone have further experiences ? Does anyone tried to
recover a BTRFS filesystem when the cache disks died ?

Oh.. wait... Now I understood: if the caching disk read badly (but
without returning an error), the bad data may be wrote on the other
disks: in this case a single failure (the cache disk) may affect
all the other disks and the redundancy is lost ...

BR
G.Baroncelli