Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
* BTRFS subvolume RAID level
@ 2019-11-28 22:48 waxhead
  2019-11-29  0:44 ` Qu Wenruo
  2019-12-02 10:48 ` Anand Jain
  0 siblings, 2 replies; 7+ messages in thread
From: waxhead @ 2019-11-28 22:48 UTC (permalink / raw)
  To: Btrfs BTRFS

Just out of curiosity....

What are the (potential) show stoppers for implementing subvolume RAID 
levels in BTRFS? This is more and more interesting now that new RAID 
levels has been merged (RAID1c3/4) and RAID5/6 is slowly inching towards 
usable status.

I imagine that RAIDc4 for example could potentially give a grotesque 
speed increase for parallel read operations once BTRFS learns to 
distribute reads to the device with the least waitqueue / fastest devices.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BTRFS subvolume RAID level
  2019-11-28 22:48 BTRFS subvolume RAID level waxhead
@ 2019-11-29  0:44 ` Qu Wenruo
  2019-12-02 10:48 ` Anand Jain
  1 sibling, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2019-11-29  0:44 UTC (permalink / raw)
  To: waxhead, Btrfs BTRFS

[-- Attachment #1.1: Type: text/plain, Size: 1108 bytes --]



On 2019/11/29 上午6:48, waxhead wrote:
> Just out of curiosity....
> 
> What are the (potential) show stoppers for implementing subvolume RAID
> levels in BTRFS? This is more and more interesting now that new RAID
> levels has been merged (RAID1c3/4) and RAID5/6 is slowly inching towards
> usable status.

My quick guesses are:
- Subvolume and RAID aware extent allocator
  Current extent allocator cares nothing about who is requesting the
  extent.
  It cares a little about the RAID profile (only for convert).
  We need some work here at least.

- Way to prevent false ENOSPC due to profile restriction
  If subvolume is using some exotic profile, while extent/csum tree is
  using regular profile, and subvolume eats too many space, making
  it impossible to fit extent in their desired profile.
  We will got ENOSPC and even goes RO.

Thanks,
Qu

> 
> I imagine that RAIDc4 for example could potentially give a grotesque
> speed increase for parallel read operations once BTRFS learns to
> distribute reads to the device with the least waitqueue / fastest devices.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BTRFS subvolume RAID level
  2019-11-28 22:48 BTRFS subvolume RAID level waxhead
  2019-11-29  0:44 ` Qu Wenruo
@ 2019-12-02 10:48 ` Anand Jain
  2019-12-02 23:27   ` waxhead
  1 sibling, 1 reply; 7+ messages in thread
From: Anand Jain @ 2019-12-02 10:48 UTC (permalink / raw)
  To: waxhead, Btrfs BTRFS


> I imagine that RAIDc4 for example could potentially give a grotesque 
> speed increase for parallel read operations once BTRFS learns to 
> distribute reads to the device with the least waitqueue / fastest devices.

  That exactly was the objective of the Readmirror patch in the ML.
  It proposed a framework to change the readmirror policy as needed.

Thanks, Anand





^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BTRFS subvolume RAID level
  2019-12-02 10:48 ` Anand Jain
@ 2019-12-02 23:27   ` waxhead
  2019-12-03  1:30     ` Anand Jain
  0 siblings, 1 reply; 7+ messages in thread
From: waxhead @ 2019-12-02 23:27 UTC (permalink / raw)
  To: Anand Jain, Btrfs BTRFS



Anand Jain wrote:
> 
>> I imagine that RAIDc4 for example could potentially give a grotesque 
>> speed increase for parallel read operations once BTRFS learns to 
>> distribute reads to the device with the least waitqueue / fastest 
>> devices.
> 
>   That exactly was the objective of the Readmirror patch in the ML.
>   It proposed a framework to change the readmirror policy as needed.
> 
> Thanks, Anand

Indeed. If I remember correctly your patch allowed for deterministic 
reading from certain devices. As just a regular btrfs user the problem I 
see with this is that you loose a "potential free scrub" that *might* 
otherwise happen from often read data. On the other hand that is what 
manual scrubbing is for anyway.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BTRFS subvolume RAID level
  2019-12-02 23:27   ` waxhead
@ 2019-12-03  1:30     ` Anand Jain
  2019-12-03 20:31       ` waxhead
  0 siblings, 1 reply; 7+ messages in thread
From: Anand Jain @ 2019-12-03  1:30 UTC (permalink / raw)
  To: waxhead, Btrfs BTRFS



On 12/3/19 7:27 AM, waxhead wrote:
> 
> 
> Anand Jain wrote:
>>
>>> I imagine that RAIDc4 for example could potentially give a grotesque 
>>> speed increase for parallel read operations once BTRFS learns to 
>>> distribute reads to the device with the least waitqueue / fastest 
>>> devices.
>>
>>   That exactly was the objective of the Readmirror patch in the ML.
>>   It proposed a framework to change the readmirror policy as needed.
>>
>> Thanks, Anand
> 
> Indeed. If I remember correctly your patch allowed for deterministic 
> reading from certain devices.
  It provides a framework to configure the readmirror policies. And the
  policies can be based on io-depth, pid, or manual for heterogeneous
  devices with different latency.

> As just a regular btrfs user the problem I 
> see with this is that you loose a "potential free scrub" that *might* 
> otherwise happen from often read data. On the other hand that is what 
> manual scrubbing is for anyway.

  Ha ha.

  When it comes to data and its reliability and availability we need
  guarantee and only deterministic approach can provide it.

  What you are asking for is to route the particular block to
  the device which was not read before so to avoid scrubbing or to
  make scrubbing more intelligent to scrub only old never read blocks
  - this will be challenging we need to keep history of block and the
  device it used for read - most likely a scope for the bpf based
  external tools but definitely not with in kernel. With in kernel
  we can create readmirror like framework so that external tool can
  achieve it.


Thanks, Anand

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BTRFS subvolume RAID level
  2019-12-03  1:30     ` Anand Jain
@ 2019-12-03 20:31       ` waxhead
  2019-12-05 12:10         ` Anand Jain
  0 siblings, 1 reply; 7+ messages in thread
From: waxhead @ 2019-12-03 20:31 UTC (permalink / raw)
  To: Anand Jain, Btrfs BTRFS



Anand Jain wrote:
> 
> 
> On 12/3/19 7:27 AM, waxhead wrote:
>>
>>
>> Anand Jain wrote:
>>>
>>>> I imagine that RAIDc4 for example could potentially give a grotesque 
>>>> speed increase for parallel read operations once BTRFS learns to 
>>>> distribute reads to the device with the least waitqueue / fastest 
>>>> devices.
>>>
>>>   That exactly was the objective of the Readmirror patch in the ML.
>>>   It proposed a framework to change the readmirror policy as needed.
>>>
>>> Thanks, Anand
>>
>> Indeed. If I remember correctly your patch allowed for deterministic 
>> reading from certain devices.
>   It provides a framework to configure the readmirror policies. And the
>   policies can be based on io-depth, pid, or manual for heterogeneous
>   devices with different latency.
> 
>> As just a regular btrfs user the problem I see with this is that you 
>> loose a "potential free scrub" that *might* otherwise happen from 
>> often read data. On the other hand that is what manual scrubbing is 
>> for anyway.
> 
>   Ha ha.
> 
>   When it comes to data and its reliability and availability we need
>   guarantee and only deterministic approach can provide it.
> 
Uhm , what I meant was that if someone sets a readmirror policy to read 
from the fastest devices in for example RAID1 and a copy exists on both 
a harddrive, and a SSD device and reads are served from the fastest 
drive (SSD) then you will never get a "accidental" read on the harddrive 
and therefore making scrubbing absolutely necessary (which it actually 
is anyway).

In other words for sloppy sysadmins, if you read data often then the 
hottest data is likely to have both copies read. If you set a policy 
that prefer to always read from SSD's it could happen that the poor 
harddrive is never "checked".

>   What you are asking for is to route the particular block to
>   the device which was not read before so to avoid scrubbing or to
>   make scrubbing more intelligent to scrub only old never read blocks
>   - this will be challenging we need to keep history of block and the
>   device it used for read - most likely a scope for the bpf based
>   external tools but definitely not with in kernel. With in kernel
>   we can create readmirror like framework so that external tool can
>   achieve it.

 From what I remember from my prevous post (I am too lazy to look it up) 
I was hoping that subvolumes could be assigned or "prioritized" to 
certain devices e.g. device groups. Which means that you could put all 
SSD's of a certain speed in one group, all harddrives in another group 
and NVMe storage devices in yet another group. Or you could put all 
devices on a certain JBOD controller board on it's own group. That way 
BTRFS could have prioritized to store certain subvolumes on a certain 
group and/or even allowing to migrate (balance) to antoher group. If you 
run out of space you can always distribute across other groups and to 
magic things there ;)

Not that I have anything against the readmirror policy , but I think 
this approach would be a lot more ideal.

> 
> 
> Thanks, Anand

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BTRFS subvolume RAID level
  2019-12-03 20:31       ` waxhead
@ 2019-12-05 12:10         ` Anand Jain
  0 siblings, 0 replies; 7+ messages in thread
From: Anand Jain @ 2019-12-05 12:10 UTC (permalink / raw)
  To: waxhead, Btrfs BTRFS



On 4/12/19 4:31 AM, waxhead wrote:
> 
> 
> Anand Jain wrote:
>>
>>
>> On 12/3/19 7:27 AM, waxhead wrote:
>>>
>>>
>>> Anand Jain wrote:
>>>>
>>>>> I imagine that RAIDc4 for example could potentially give a 
>>>>> grotesque speed increase for parallel read operations once BTRFS 
>>>>> learns to distribute reads to the device with the least waitqueue / 
>>>>> fastest devices.
>>>>
>>>>   That exactly was the objective of the Readmirror patch in the ML.
>>>>   It proposed a framework to change the readmirror policy as needed.
>>>>
>>>> Thanks, Anand
>>>
>>> Indeed. If I remember correctly your patch allowed for deterministic 
>>> reading from certain devices.
>>   It provides a framework to configure the readmirror policies. And the
>>   policies can be based on io-depth, pid, or manual for heterogeneous
>>   devices with different latency.
>>
>>> As just a regular btrfs user the problem I see with this is that you 
>>> loose a "potential free scrub" that *might* otherwise happen from 
>>> often read data. On the other hand that is what manual scrubbing is 
>>> for anyway.
>>
>>   Ha ha.
>>
>>   When it comes to data and its reliability and availability we need
>>   guarantee and only deterministic approach can provide it.
>>
> Uhm , what I meant was that if someone sets a readmirror policy to read 
> from the fastest devices in for example RAID1 and a copy exists on both 
> a harddrive, and a SSD device and reads are served from the fastest 
> drive (SSD) then you will never get a "accidental" read on the harddrive 
> and therefore making scrubbing absolutely necessary (which it actually 
> is anyway).
> 
> In other words for sloppy sysadmins, if you read data often then the 
> hottest data is likely to have both copies read. If you set a policy 
> that prefer to always read from SSD's it could happen that the poor 
> harddrive is never "checked".
> 



>>   What you are asking for is to route the particular block to
>>   the device which was not read before so to avoid scrubbing or to
>>   make scrubbing more intelligent to scrub only old never read blocks
>>   - this will be challenging we need to keep history of block and the
>>   device it used for read - most likely a scope for the bpf based
>>   external tools but definitely not with in kernel. With in kernel
>>   we can create readmirror like framework so that external tool can
>>   achieve it.
> 
>  From what I remember from my prevous post (I am too lazy to look it up) 
> I was hoping that subvolumes could be assigned or "prioritized" to 
> certain devices e.g. device groups. Which means that you could put all 
> SSD's of a certain speed in one group, all harddrives in another group 
> and NVMe storage devices in yet another group. Or you could put all 
> devices on a certain JBOD controller board on it's own group. That way 
> BTRFS could have prioritized to store certain subvolumes on a certain 
> group and/or even allowing to migrate (balance) to antoher group. If you 
> run out of space you can always distribute across other groups and to 
> magic things there ;)
> 
> Not that I have anything against the readmirror policy , but I think 
> this approach would be a lot more ideal.

  Yep. I remember [1] you brought subvolume to be able to direct the
  read IO.

  [1]
  https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg86467.html

  The idea is indeed good. But its not possible to implement as we
  share and link blocks across subvolumes and snapshots or it may
  come with too many limitations and gets messy.

Thanks, Anand


>>
>>
>> Thanks, Anand

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, back to index

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-28 22:48 BTRFS subvolume RAID level waxhead
2019-11-29  0:44 ` Qu Wenruo
2019-12-02 10:48 ` Anand Jain
2019-12-02 23:27   ` waxhead
2019-12-03  1:30     ` Anand Jain
2019-12-03 20:31       ` waxhead
2019-12-05 12:10         ` Anand Jain

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git