All of lore.kernel.org
 help / color / mirror / Atom feed
From: Timofey Titovets <nefelim4ag@gmail.com>
To: "Janos Toth F." <toth.f.janos@gmail.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: RFC: raid with a variable stripe size
Date: Fri, 18 Nov 2016 23:51:33 +0300	[thread overview]
Message-ID: <CAGqmi75BR8=dXGUcMHnj15H5i8dEH=d-y2vMHoAnGwMksHfj-Q@mail.gmail.com> (raw)
In-Reply-To: <CANznX5EvKzySaFrRFOULTkUqGGMRhKAQZEphMD9Kww_iOkN25A@mail.gmail.com>

2016-11-18 23:32 GMT+03:00 Janos Toth F. <toth.f.janos@gmail.com>:
> Based on the comments of this patch, stripe size could theoretically
> go as low as 512 byte:
> https://mail-archive.com/linux-btrfs@vger.kernel.org/msg56011.html
> If these very small (0.5k-2k) stripe sizes could really work (it's
> possible to implement such changes and it does not degrade performance
> too much - or at all - to keep it so low), we could use RAID-5(/6) on
> <=9(/10) disks with 512 byte physical sectors (assuming 4k filesystem
> sector size + 4k node size, although I am not sure if node size is
> really important here) without having to worry about RMW, extra space
> waste or additional fragmentation.
>
> On Fri, Nov 18, 2016 at 7:15 PM, Goffredo Baroncelli <kreijack@libero.it> wrote:
>> Hello,
>>
>> these are only my thoughts; no code here, but I would like to share it hoping that it could be useful.
>>
>> As reported several times by Zygo (and others), one of the problem of raid5/6 is the write hole. Today BTRFS is not capable to address it.
>>
>> The problem is that the stripe size is bigger than the "sector size" (ok sector is not the correct word, but I am referring to the basic unit of writing on the disk, which is 4k or 16K in btrfs).
>> So when btrfs writes less data than the stripe, the stripe is not filled; when it is filled by a subsequent write, a RMW of the parity is required.
>>
>> On the best of my understanding (which could be very wrong) ZFS try to solve this issue using a variable length stripe.
>>
>> On BTRFS this could be achieved using several BGs (== block group or chunk), one for each stripe size.
>>
>> For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem should have three BGs:
>> BG #1,composed by two disks (1 data+ 1 parity)
>> BG #2 composed by three disks (2 data + 1 parity)
>> BG #3 composed by four disks (3 data + 1 parity).
>>
>> If the data to be written has a size of 4k, it will be allocated to the BG #1.
>> If the data to be written has a size of 8k, it will be allocated to the BG #2
>> If the data to be written has a size of 12k, it will be allocated to the BG #3
>> If the data to be written has a size greater than 12k, it will be allocated to the BG3, until the data fills a full stripes; then the remainder will be stored in BG #1 or BG #2.
>>
>>
>> To avoid unbalancing of the disk usage, each BG could use all the disks, even if a stripe uses less disks: i.e
>>
>> DISK1 DISK2 DISK3 DISK4
>> S1    S1    S1    S2
>> S2    S2    S3    S3
>> S3    S4    S4    S4
>> [....]
>>
>> Above is show a BG which uses all the four disks, but has a stripe which spans only 3 disks.
>>
>>
>> Pro:
>> - btrfs already is capable to handle different BG in the filesystem, only the allocator has to change
>> - no more RMW are required (== higher performance)
>>
>> Cons:
>> - the data will be more fragmented
>> - the filesystem, will have more BGs; this will require time-to time a re-balance. But is is an issue which we already know (even if may be not 100% addressed).
>>
>>
>> Thoughts ?
>>
>> BR
>> G.Baroncelli
>>
>>
>>
>> --
>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

AFAIK all drives at now use 4k physical sector size, and use 512b only logically
So it's create another RWM Read 4k -> Modify 512b -> Write 4k, instead
of just write 512b.

-- 
Have a nice day,
Timofey.

  reply	other threads:[~2016-11-18 20:52 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-18 18:15 RFC: raid with a variable stripe size Goffredo Baroncelli
2016-11-18 20:32 ` Janos Toth F.
2016-11-18 20:51   ` Timofey Titovets [this message]
2016-11-18 21:38     ` Janos Toth F.
2016-11-19  8:55   ` Goffredo Baroncelli
2016-11-18 20:34 ` Timofey Titovets
2016-11-19  8:59   ` Goffredo Baroncelli
2016-11-19  8:22 ` Zygo Blaxell
2016-11-19  9:13   ` Goffredo Baroncelli
2016-11-29  0:48 ` Qu Wenruo
2016-11-29  3:53   ` Zygo Blaxell
2016-11-29  4:12     ` Qu Wenruo
2016-11-29  4:55       ` Zygo Blaxell
2016-11-29  5:49         ` Qu Wenruo
2016-11-29 18:47           ` Janos Toth F.
2016-11-29 22:51           ` Zygo Blaxell
2016-11-29  5:51   ` Chris Murphy
2016-11-29  6:03     ` Qu Wenruo
2016-11-29 18:19       ` Goffredo Baroncelli
2016-11-29 22:54       ` Zygo Blaxell
2016-11-29 18:10   ` Goffredo Baroncelli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGqmi75BR8=dXGUcMHnj15H5i8dEH=d-y2vMHoAnGwMksHfj-Q@mail.gmail.com' \
    --to=nefelim4ag@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=toth.f.janos@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.