All of lore.kernel.org
 help / color / mirror / Atom feed
* Btrfs - distribute files equally across multiple devices
@ 2015-07-06 16:22 Johannes Pfrang
  2015-07-06 16:45 ` Roman Mamedov
  2015-07-06 17:53 ` Hugo Mills
  0 siblings, 2 replies; 5+ messages in thread
From: Johannes Pfrang @ 2015-07-06 16:22 UTC (permalink / raw)
  To: linux-btrfs

Cross-posting my unix.stackexchange.com question[1] to the btrfs list
(slightly modified):

[1]
https://unix.stackexchange.com/questions/214009/btrfs-distribute-files-equally-across-multiple-devices

---------------------------------------------------------------------------------

I have a btrfs volume across two devices that has metadata RAID 1 and
data RAID 0. AFAIK, in the event one drive would fail, practically all
files above the 64KB default stripe size would be corrupted. As this
partition isn't performance critical, but should be space-efficient,
I've thought about re-balancing the filesystem to distribute files
equally across disks, but something like that doesn't seem to exist. The
ultimate goal would be to be able to still read some of the files in the
event of a drive failure.

AFAIK, using "single"/linear data allocation just fills up drives one by
one (at least that's what the wiki says).

Simple example (according to my best knowledge):

Write two 128KB files (file0, file1) to two devices (dev0, dev1):

RAID0:

    file0/chunk0 (64KB): dev0
    file0/chunk1 (64KB): dev1
    file1/chunk0 (64KB): dev0
    file1/chunk1 (64KB): dev1

Linear:

    file0 (128KB): dev0
    file1 (128KB): dev0

distribute files:

    file0 (128KB): dev0
    file1 (128KB): dev1

The simplest implementation would probably be something like: Always
write files to the disk with the least amount of space used. I think
this may be a valid software-raid use-case, as it combines RAID 0 (w/o
some of the performance gains[2]) with recoverability of about half of
the data/files (balanced by filled space or amount of files) in the
event of a drive-failure[3] by using filesystem information a
hardware-raid doesn't have. In the end this is more or less JBOD with
balanced disk usage + filesystem intelligence.

Is there something like that already in btrfs or could this be something
the btrfs-devs would consider?


[2] Still can read/write multiple files from/to different disks, so less
performance only for "single-file-reads/writes"
[3] using two disks, otherwise (totalDisks-failedDisks)/totalDisks

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Btrfs - distribute files equally across multiple devices
  2015-07-06 16:22 Btrfs - distribute files equally across multiple devices Johannes Pfrang
@ 2015-07-06 16:45 ` Roman Mamedov
  2015-07-06 17:31   ` Johannes Pfrang
  2015-07-06 17:53 ` Hugo Mills
  1 sibling, 1 reply; 5+ messages in thread
From: Roman Mamedov @ 2015-07-06 16:45 UTC (permalink / raw)
  To: Johannes Pfrang; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 724 bytes --]

On Mon, 6 Jul 2015 18:22:52 +0200
Johannes Pfrang <johannespfrang@gmail.com> wrote:

> The simplest implementation would probably be something like: Always
> write files to the disk with the least amount of space used. I think
> this may be a valid software-raid use-case, as it combines RAID 0 (w/o
> some of the performance gains[2]) with recoverability of about half of
> the data/files (balanced by filled space or amount of files) in the
> event of a drive-failure[3] by using filesystem information a
> hardware-raid doesn't have. In the end this is more or less JBOD with
> balanced disk usage + filesystem intelligence.

mhddfs does exactly that: https://romanrm.net/mhddfs

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Btrfs - distribute files equally across multiple devices
  2015-07-06 16:45 ` Roman Mamedov
@ 2015-07-06 17:31   ` Johannes Pfrang
  0 siblings, 0 replies; 5+ messages in thread
From: Johannes Pfrang @ 2015-07-06 17:31 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1925 bytes --]

That looks quite interesting!
Unfortunately this removes the ability to specify different RAID-levels
for metadata vs data and actually behaves more like btrfs "single" mode.
According to your link it fills drive by drive instead of distributing
files equally across them:

"When you create a new file in the virtual filesystem, |mhddfs| will
look at the free space, which remains on each of the drives. If the
first drive has enough free space, the file will be created on that
first drive."

What I propose (simplest implementation):

"When you create a new file in the filesystem, btrfswill look at used
space on each of the drives. The file will be created on the drive with
the least used space that can hold the file."

Difference:

mhddfs only achieves maximum recoverability once the filesystem is full
(just like "single"), while my proposal achieves such recoverability
from the start.
(maximum recoverability by (totalDisks-failedDisks)/totalDisks as
percentage of recoverable data/files (depending on by which the fs is
balanced))

Also I'm not sure if it's compatible to btrfs's special
remaining-space-calculation magic ^^

On 06.07.2015 18:45, Roman Mamedov wrote:
> On Mon, 6 Jul 2015 18:22:52 +0200
> Johannes Pfrang <johannespfrang@gmail.com> wrote:
>
>> The simplest implementation would probably be something like: Always
>> write files to the disk with the least amount of space used. I think
>> this may be a valid software-raid use-case, as it combines RAID 0 (w/o
>> some of the performance gains[2]) with recoverability of about half of
>> the data/files (balanced by filled space or amount of files) in the
>> event of a drive-failure[3] by using filesystem information a
>> hardware-raid doesn't have. In the end this is more or less JBOD with
>> balanced disk usage + filesystem intelligence.
> mhddfs does exactly that: https://romanrm.net/mhddfs
>



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Btrfs - distribute files equally across multiple devices
  2015-07-06 16:22 Btrfs - distribute files equally across multiple devices Johannes Pfrang
  2015-07-06 16:45 ` Roman Mamedov
@ 2015-07-06 17:53 ` Hugo Mills
  2015-07-06 18:34   ` Johannes Pfrang
  1 sibling, 1 reply; 5+ messages in thread
From: Hugo Mills @ 2015-07-06 17:53 UTC (permalink / raw)
  To: Johannes Pfrang; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4056 bytes --]

On Mon, Jul 06, 2015 at 06:22:52PM +0200, Johannes Pfrang wrote:
> Cross-posting my unix.stackexchange.com question[1] to the btrfs list
> (slightly modified):
> 
> [1]
> https://unix.stackexchange.com/questions/214009/btrfs-distribute-files-equally-across-multiple-devices
> 
> ---------------------------------------------------------------------------------
> 
> I have a btrfs volume across two devices that has metadata RAID 1 and
> data RAID 0. AFAIK, in the event one drive would fail, practically all
> files above the 64KB default stripe size would be corrupted. As this
> partition isn't performance critical, but should be space-efficient,
> I've thought about re-balancing the filesystem to distribute files
> equally across disks, but something like that doesn't seem to exist. The
> ultimate goal would be to be able to still read some of the files in the
> event of a drive failure.
> 
> AFAIK, using "single"/linear data allocation just fills up drives one by
> one (at least that's what the wiki says).

   Not quite. In single mode, the FS will allocate linear chunks of
space 1 GiB in size, and use those to write into (fitting many files
into each chunk, potentially). The chunks are allocated as needed, and
will go on the device with the most unallocated space.

   So, with equal-sized devices, the first 1 GiB will go on the first
device, the second 1 GiB on the second device, and so on.

   With unequal devices, you'll put data on the largest device, until
its free space reaches the size of the next largest, and then the
chunks will be alternated between those two, until the free space on
each of the two largest reaches the size of the third-largest, and so
on.

   (e.g. for devices sized 6 TB, 4 TB, 3 TB, the first 2 TB will go
exclusively on the first device; the next 2 TB will go on the first
two devices, alternating in 1 GB chunks; the rest goes across all
three devices, again, alternating in 1 GB chunks.)

   This is all very well for an append-only filesystem, but if you're
changing the files on the FS at all, there's no guarantee as to where
the changed extents will end up -- not even on the same device, let
alone close to the rest of the file on the platter.

   I did work out, some time ago, a prototype chunk allocator (the 1
GiB-scale allocations) that would allow enough flexibility to control
where the next chunk to be allocated would go. However, that still
leaves the extent allocator to deal with, which is the second, and
much harder, part of the problem.

   Basically, don't assume any kind of structure to the location of
your data on the devices you have, and keep good, tested, regular
backups of anything you can't stand to lose and can't replace. There
are no guarantees that would let you assume easily that any one file
is on a single device, or that anything would survive the loss of a
device.

   I'm sure this is an FAQ entry somewhere... It's come up enough
times.

   Hugo.

> The simplest implementation would probably be something like: Always
> write files to the disk with the least amount of space used. I think
> this may be a valid software-raid use-case, as it combines RAID 0 (w/o
> some of the performance gains[2]) with recoverability of about half of
> the data/files (balanced by filled space or amount of files) in the
> event of a drive-failure[3] by using filesystem information a
> hardware-raid doesn't have. In the end this is more or less JBOD with
> balanced disk usage + filesystem intelligence.
> 
> Is there something like that already in btrfs or could this be something
> the btrfs-devs would consider?
> 
> 
> [2] Still can read/write multiple files from/to different disks, so less
> performance only for "single-file-reads/writes"
> [3] using two disks, otherwise (totalDisks-failedDisks)/totalDisks

-- 
Hugo Mills             | "How deep will this sub go?"
hugo@... carfax.org.uk | "Oh, she'll go all the way to the bottom if we don't
http://carfax.org.uk/  | stop her."
PGP: E2AB1DE4          |                                                  U571

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Btrfs - distribute files equally across multiple devices
  2015-07-06 17:53 ` Hugo Mills
@ 2015-07-06 18:34   ` Johannes Pfrang
  0 siblings, 0 replies; 5+ messages in thread
From: Johannes Pfrang @ 2015-07-06 18:34 UTC (permalink / raw)
  To: Hugo Mills, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2709 bytes --]

Thank you. That's a very helpful explanation. I've just did balance
start -dconvert=single ;)
Fwiw, the best explanation about "single" I could find was in the
Glossary[1].
I don't have an account on the wiki, but your first paragraph would fit
great there!


[1] https://btrfs.wiki.kernel.org/index.php/Glossary


On 06.07.2015 19:53, Hugo Mills wrote:
> On Mon, Jul 06, 2015 at 06:22:52PM +0200, Johannes Pfrang wrote:
>    Not quite. In single mode, the FS will allocate linear chunks of
> space 1 GiB in size, and use those to write into (fitting many files
> into each chunk, potentially). The chunks are allocated as needed, and
> will go on the device with the most unallocated space.
>
>    So, with equal-sized devices, the first 1 GiB will go on the first
> device, the second 1 GiB on the second device, and so on.
>
>    With unequal devices, you'll put data on the largest device, until
> its free space reaches the size of the next largest, and then the
> chunks will be alternated between those two, until the free space on
> each of the two largest reaches the size of the third-largest, and so
> on.
>
>    (e.g. for devices sized 6 TB, 4 TB, 3 TB, the first 2 TB will go
> exclusively on the first device; the next 2 TB will go on the first
> two devices, alternating in 1 GB chunks; the rest goes across all
> three devices, again, alternating in 1 GB chunks.)
>
>    This is all very well for an append-only filesystem, but if you're
> changing the files on the FS at all, there's no guarantee as to where
> the changed extents will end up -- not even on the same device, let
> alone close to the rest of the file on the platter.
>
>    I did work out, some time ago, a prototype chunk allocator (the 1
> GiB-scale allocations) that would allow enough flexibility to control
> where the next chunk to be allocated would go. However, that still
> leaves the extent allocator to deal with, which is the second, and
> much harder, part of the problem.
>
>    Basically, don't assume any kind of structure to the location of
> your data on the devices you have, and keep good, tested, regular
> backups of anything you can't stand to lose and can't replace. There
> are no guarantees that would let you assume easily that any one file
> is on a single device, or that anything would survive the loss of a
> device.
I promise I won't assume that.

Two 4TB data disks:
- 3TiB+3TiB data=single,meta=raid1 replaceable/unimportant
- 654Gib|654Gib data/meta=raid1 important with regular backups

efficient + safe enough (for my use-case)
>
>    I'm sure this is an FAQ entry somewhere... It's come up enough
> times.
>
>    Hugo.
>
>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-07-06 18:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-06 16:22 Btrfs - distribute files equally across multiple devices Johannes Pfrang
2015-07-06 16:45 ` Roman Mamedov
2015-07-06 17:31   ` Johannes Pfrang
2015-07-06 17:53 ` Hugo Mills
2015-07-06 18:34   ` Johannes Pfrang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.