Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
@ 2020-04-27 15:06 Torstein Eide
  2020-04-28 19:31 ` Goffredo Baroncelli
  0 siblings, 1 reply; 17+ messages in thread
From: Torstein Eide @ 2020-04-27 15:06 UTC (permalink / raw)
  To: kreijack; +Cc: hugo, linux-btrfs, martin.svec, mclaud, wangyugui

How will affect sleep of disk? will it reduce the number of wake up
call, to the HDD?

-- 
Torstein Eide
Torsteine@gmail.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-27 15:06 [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD Torstein Eide
@ 2020-04-28 19:31 ` Goffredo Baroncelli
  0 siblings, 0 replies; 17+ messages in thread
From: Goffredo Baroncelli @ 2020-04-28 19:31 UTC (permalink / raw)
  To: Torstein Eide; +Cc: hugo, linux-btrfs, martin.svec, mclaud, wangyugui

On 4/27/20 5:06 PM, Torstein Eide wrote:
> How will affect sleep of disk? will it reduce the number of wake up
> call, to the HDD?
No; this patch put the metadata on the SSD, leaving data in the HDD;
this means that if you add data to a file both the HDD and the SSD will be used.
  
BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-05-30  6:48   ` Goffredo Baroncelli
@ 2020-05-30  8:57     ` Paul Jones
  0 siblings, 0 replies; 17+ messages in thread
From: Paul Jones @ 2020-05-30  8:57 UTC (permalink / raw)
  To: kreijack, Qu Wenruo, linux-btrfs
  Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui

> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
> owner@vger.kernel.org> On Behalf Of Goffredo Baroncelli
> Sent: Saturday, 30 May 2020 4:48 PM
> To: Qu Wenruo <quwenruo.btrfs@gmx.com>; linux-btrfs@vger.kernel.org
> Cc: Michael <mclaud@roznica.com.ua>; Hugo Mills <hugo@carfax.org.uk>;
> Martin Svec <martin.svec@zoner.cz>; Wang Yugui <wangyugui@e16-
> tech.com>
> Subject: Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
> 
> On 5/30/20 6:59 AM, Qu Wenruo wrote:
> [...]
> >> This new mode is enabled passing the option ssd_metadata at mount
> time.
> >> This policy of allocation is the "preferred" one. If this doesn't
> >> permit a chunk allocation, the "classic" one is used.
> >
> > One thing to improve here, in fact we can use existing members to
> > restore the device related info:
> > - btrfs_dev_item::seek_speed
> > - btrfs_dev_item::bandwidth (I tend to rename it to IOPS)
> 
> Hi Qu,
> 
> this path was an older version,the current one (sent 2 days ago) store the
> setting of which disks has to be considered as "preferred_metadata".
> >
> > In fact, what you're trying to do is to provide a policy to allocate
> > chunks based on each device performance characteristics.
> >
> > I believe it would be super awesome, but to get it upstream, I guess
> > we would prefer a more flex framework, thus it would be pretty slow to
> merge.
> 
> I agree. And considering that in the near future the SSD will become more
> widespread, I don't know if the effort (and the time required) are worth.

I think it will be. Consider a large 10TB+ filesystem that runs on cheap unbuffered SSDs - Metadata will still be a bottleneck like it is now, just everything happens much faster. Archival storage will likely be rotational based for a long time yet for cost reasons, and this is where ssd metadata shines. I've been running your ssd_metadata patch for over a month now and it's flipping fantastic! The responsiveness it brings to networked archival storage is amazing.

Paul.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-05-30  4:59 ` Qu Wenruo
@ 2020-05-30  6:48   ` Goffredo Baroncelli
  2020-05-30  8:57     ` Paul Jones
  0 siblings, 1 reply; 17+ messages in thread
From: Goffredo Baroncelli @ 2020-05-30  6:48 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui

On 5/30/20 6:59 AM, Qu Wenruo wrote:
[...]
>> This new mode is enabled passing the option ssd_metadata at mount time.
>> This policy of allocation is the "preferred" one. If this doesn't permit
>> a chunk allocation, the "classic" one is used.
> 
> One thing to improve here, in fact we can use existing members to
> restore the device related info:
> - btrfs_dev_item::seek_speed
> - btrfs_dev_item::bandwidth (I tend to rename it to IOPS)

Hi Qu,

this path was an older version,the current one (sent 2 days ago) store the setting
of which disks has to be considered as "preferred_metadata".
> 
> In fact, what you're trying to do is to provide a policy to allocate
> chunks based on each device performance characteristics.
> 
> I believe it would be super awesome, but to get it upstream, I guess we
> would prefer a more flex framework, thus it would be pretty slow to merge.

I agree. And considering that in the near future the SSD will become more
widespread, I don't know if the effort (and the time required) are worth.

> 
> But still, thanks for your awesome idea.
> 
> Thanks,
> Qu
> 
> 
>>
>> Some examples: (/dev/sd[abc] are ssd, and /dev/sd[ef] are rotational)
>>
>> Non striped profile: metadata->raid1, data->raid1
>> The data is stored on /dev/sd[ef], metadata is stored on /dev/sd[abc].
>> When /dev/sd[ef] are full, then the data chunk is allocated also on
>> /dev/sd[abc].
>>
>> Striped profile: metadata->raid6, data->raid6
>> raid6 requires 3 disks at minimum, so /dev/sd[ef] are not enough for a
>> data profile raid6. To allow a data chunk allocation, the data profile raid6
>> will be stored on all the disks /dev/sd[abcdef].
>> Instead the metadata profile raid6 will be allocated on /dev/sd[abc],
>> because these are enough to host this chunk.
>>
>> Changelog:
>> v1: - first issue
>> v2: - rebased to v5.6.2
>>      - correct the comparison about the rotational disks (>= instead of >)
>>      - add the flag rotational to the struct btrfs_device_info to
>>        simplify the comparison function (btrfs_cmp_device_info*() )
>> v3: - correct the collision between BTRFS_MOUNT_DISCARD_ASYNC and
>>        BTRFS_MOUNT_SSD_METADATA.
>>
>> Below I collected some data to highlight the performance increment.
>>
>> Test setup:
>> I performed as test a "dist-upgrade" of a Debian from stretch to buster.
>> The test consisted in an image of a Debian stretch[1]  with the packages
>> needed under /var/cache/apt/archives/ (so no networking was involved).
>> For each test I formatted the filesystem from scratch, un-tar-red the
>> image and the ran "apt-get dist-upgrade" [2]. For each disk(s)/filesystem
>> combination I measured the time of apt dist-upgrade with and
>> without the flag "force-unsafe-io" which reduce the using of sync(2) and
>> flush(2). The ssd was 20GB big, the hdd was 230GB big,
>>
>> I considered the following scenarios:
>> - btrfs over ssd
>> - btrfs over ssd + hdd with my patch enabled
>> - btrfs over bcache over hdd+ssd
>> - btrfs over hdd (very, very slow....)
>> - ext4 over ssd
>> - ext4 over hdd
>>
>> The test machine was an "AMD A6-6400K" with 4GB of ram, where 3GB was used
>> as cache/buff.
>>
>> Data analysis:
>>
>> Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
>> apt on a rotational was a dramatic experience. And IMHO  this should be replaced
>> by using the btrfs snapshot capabilities. But this is another (not easy) story.
>>
>> Unsurprising bcache performs better than my patch. But this is an expected
>> result because it can cache also the data chunk (the read can goes directly to
>> the ssd). bcache perform about +60% slower when there are a lot of sync/flush
>> and only +20% in the other case.
>>
>> Regarding the test with force-unsafe-io (fewer sync/flush), my patch reduce the
>> time from +256% to +113%  than the hdd-only . Which I consider a good
>> results considering how small is the patch.
>>
>>
>> Raw data:
>> The data below is the "real" time (as return by the time command) consumed by
>> apt
>>
>>
>> Test description         real (mmm:ss)	Delta %
>> --------------------     -------------  -------
>> btrfs hdd w/sync	   142:38	+533%
>> btrfs ssd+hdd w/sync        81:04	+260%
>> ext4 hdd w/sync	            52:39	+134%
>> btrfs bcache w/sync	    35:59	 +60%
>> btrfs ssd w/sync	    22:31	reference
>> ext4 ssd w/sync	            12:19	 -45%
>>
>>
>>
>> Test description         real (mmm:ss)	Delta %
>> --------------------     -------------  -------
>> btrfs hdd	             56:2	+256%
>> ext4 hdd	            51:32	+228%
>> btrfs ssd+hdd	            33:30	+113%
>> btrfs bcache	            18:57	 +20%
>> btrfs ssd	            15:44	reference
>> ext4 ssd	            11:49	 -25%
>>
>>
>> [1] I created the image, using "debootrap stretch", then I installed a set
>> of packages using the commands:
>>
>>    # debootstrap stretch test/
>>    # chroot test/
>>    # mount -t proc proc proc
>>    # mount -t sysfs sys sys
>>    # apt --option=Dpkg::Options::=--force-confold \
>>          --option=Dpkg::options::=--force-unsafe-io \
>> 	install mate-desktop-environment* xserver-xorg vim \
>>          task-kde-desktop task-gnome-desktop
>>
>> Then updated the release from stretch to buster changing the file /etc/apt/source.list
>> Then I download the packages for the dist upgrade:
>>
>>    # apt-get update
>>    # apt-get --download-only dist-upgrade
>>
>> Then I create a tar of this image.
>> Before the dist upgrading the space used was about 7GB of space with 2281
>> packages. After the dist-upgrade, the space used was 9GB with 2870 packages.
>> The upgrade installed/updated about 2251 packages.
>>
>>
>> [2] The command was a bit more complex, to avoid an interactive session
>>
>>    # mkfs.btrfs -m single -d single /dev/sdX
>>    # mount /dev/sdX test/
>>    # cd test
>>    # time tar xzf ../image.tgz
>>    # chroot .
>>    # mount -t proc proc proc
>>    # mount -t sysfs sys sys
>>    # export DEBIAN_FRONTEND=noninteractive
>>    # time apt-get -y --option=Dpkg::Options::=--force-confold \
>> 	--option=Dpkg::options::=--force-unsafe-io dist-upgrade
>>
>>
>> BR
>> G.Baroncelli
>>


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-05  8:26 Goffredo Baroncelli
  2020-04-05 10:57 ` Graham Cobb
  2020-05-29 16:06 ` Hans van Kranenburg
@ 2020-05-30  4:59 ` Qu Wenruo
  2020-05-30  6:48   ` Goffredo Baroncelli
  2 siblings, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2020-05-30  4:59 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs
  Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui



On 2020/4/5 下午4:26, Goffredo Baroncelli wrote:
>
> Hi all,
>
> This is an RFC; I wrote this patch because I find the idea interesting
> even though it adds more complication to the chunk allocator.
>
> The core idea is to store the metadata on the ssd and to leave the data
> on the rotational disks. BTRFS looks at the rotational flags to
> understand the kind of disks.
>
> This new mode is enabled passing the option ssd_metadata at mount time.
> This policy of allocation is the "preferred" one. If this doesn't permit
> a chunk allocation, the "classic" one is used.

One thing to improve here, in fact we can use existing members to
restore the device related info:
- btrfs_dev_item::seek_speed
- btrfs_dev_item::bandwidth (I tend to rename it to IOPS)

In fact, what you're trying to do is to provide a policy to allocate
chunks based on each device performance characteristics.

I believe it would be super awesome, but to get it upstream, I guess we
would prefer a more flex framework, thus it would be pretty slow to merge.

But still, thanks for your awesome idea.

Thanks,
Qu


>
> Some examples: (/dev/sd[abc] are ssd, and /dev/sd[ef] are rotational)
>
> Non striped profile: metadata->raid1, data->raid1
> The data is stored on /dev/sd[ef], metadata is stored on /dev/sd[abc].
> When /dev/sd[ef] are full, then the data chunk is allocated also on
> /dev/sd[abc].
>
> Striped profile: metadata->raid6, data->raid6
> raid6 requires 3 disks at minimum, so /dev/sd[ef] are not enough for a
> data profile raid6. To allow a data chunk allocation, the data profile raid6
> will be stored on all the disks /dev/sd[abcdef].
> Instead the metadata profile raid6 will be allocated on /dev/sd[abc],
> because these are enough to host this chunk.
>
> Changelog:
> v1: - first issue
> v2: - rebased to v5.6.2
>     - correct the comparison about the rotational disks (>= instead of >)
>     - add the flag rotational to the struct btrfs_device_info to
>       simplify the comparison function (btrfs_cmp_device_info*() )
> v3: - correct the collision between BTRFS_MOUNT_DISCARD_ASYNC and
>       BTRFS_MOUNT_SSD_METADATA.
>
> Below I collected some data to highlight the performance increment.
>
> Test setup:
> I performed as test a "dist-upgrade" of a Debian from stretch to buster.
> The test consisted in an image of a Debian stretch[1]  with the packages
> needed under /var/cache/apt/archives/ (so no networking was involved).
> For each test I formatted the filesystem from scratch, un-tar-red the
> image and the ran "apt-get dist-upgrade" [2]. For each disk(s)/filesystem
> combination I measured the time of apt dist-upgrade with and
> without the flag "force-unsafe-io" which reduce the using of sync(2) and
> flush(2). The ssd was 20GB big, the hdd was 230GB big,
>
> I considered the following scenarios:
> - btrfs over ssd
> - btrfs over ssd + hdd with my patch enabled
> - btrfs over bcache over hdd+ssd
> - btrfs over hdd (very, very slow....)
> - ext4 over ssd
> - ext4 over hdd
>
> The test machine was an "AMD A6-6400K" with 4GB of ram, where 3GB was used
> as cache/buff.
>
> Data analysis:
>
> Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
> apt on a rotational was a dramatic experience. And IMHO  this should be replaced
> by using the btrfs snapshot capabilities. But this is another (not easy) story.
>
> Unsurprising bcache performs better than my patch. But this is an expected
> result because it can cache also the data chunk (the read can goes directly to
> the ssd). bcache perform about +60% slower when there are a lot of sync/flush
> and only +20% in the other case.
>
> Regarding the test with force-unsafe-io (fewer sync/flush), my patch reduce the
> time from +256% to +113%  than the hdd-only . Which I consider a good
> results considering how small is the patch.
>
>
> Raw data:
> The data below is the "real" time (as return by the time command) consumed by
> apt
>
>
> Test description         real (mmm:ss)	Delta %
> --------------------     -------------  -------
> btrfs hdd w/sync	   142:38	+533%
> btrfs ssd+hdd w/sync        81:04	+260%
> ext4 hdd w/sync	            52:39	+134%
> btrfs bcache w/sync	    35:59	 +60%
> btrfs ssd w/sync	    22:31	reference
> ext4 ssd w/sync	            12:19	 -45%
>
>
>
> Test description         real (mmm:ss)	Delta %
> --------------------     -------------  -------
> btrfs hdd	             56:2	+256%
> ext4 hdd	            51:32	+228%
> btrfs ssd+hdd	            33:30	+113%
> btrfs bcache	            18:57	 +20%
> btrfs ssd	            15:44	reference
> ext4 ssd	            11:49	 -25%
>
>
> [1] I created the image, using "debootrap stretch", then I installed a set
> of packages using the commands:
>
>   # debootstrap stretch test/
>   # chroot test/
>   # mount -t proc proc proc
>   # mount -t sysfs sys sys
>   # apt --option=Dpkg::Options::=--force-confold \
>         --option=Dpkg::options::=--force-unsafe-io \
> 	install mate-desktop-environment* xserver-xorg vim \
>         task-kde-desktop task-gnome-desktop
>
> Then updated the release from stretch to buster changing the file /etc/apt/source.list
> Then I download the packages for the dist upgrade:
>
>   # apt-get update
>   # apt-get --download-only dist-upgrade
>
> Then I create a tar of this image.
> Before the dist upgrading the space used was about 7GB of space with 2281
> packages. After the dist-upgrade, the space used was 9GB with 2870 packages.
> The upgrade installed/updated about 2251 packages.
>
>
> [2] The command was a bit more complex, to avoid an interactive session
>
>   # mkfs.btrfs -m single -d single /dev/sdX
>   # mount /dev/sdX test/
>   # cd test
>   # time tar xzf ../image.tgz
>   # chroot .
>   # mount -t proc proc proc
>   # mount -t sysfs sys sys
>   # export DEBIAN_FRONTEND=noninteractive
>   # time apt-get -y --option=Dpkg::Options::=--force-confold \
> 	--option=Dpkg::options::=--force-unsafe-io dist-upgrade
>
>
> BR
> G.Baroncelli
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-05-29 16:40   ` Goffredo Baroncelli
@ 2020-05-29 18:37     ` Hans van Kranenburg
  0 siblings, 0 replies; 17+ messages in thread
From: Hans van Kranenburg @ 2020-05-29 18:37 UTC (permalink / raw)
  To: kreijack, linux-btrfs; +Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui

On 5/29/20 6:40 PM, Goffredo Baroncelli wrote:
> On 5/29/20 6:06 PM, Hans van Kranenburg wrote:
>> Hi Goffredo,
>>
>> On 4/5/20 10:26 AM, Goffredo Baroncelli wrote:
>>>
>>> This is an RFC; I wrote this patch because I find the idea interesting
>>> even though it adds more complication to the chunk allocator.
>>>
>>> The core idea is to store the metadata on the ssd and to leave the data
>>> on the rotational disks. BTRFS looks at the rotational flags to
>>> understand the kind of disks.
>>
>> Like I said yesterday, thanks for working on these kind of proof of
>> concepts. :)
>>
>> Even while this can't be a final solution, it's still very useful in the
>> meantime for users for which this is sufficient right now.
>>
>> I simply did not realize before that it was possible to just set that
>> rotational flag myself using an udev rule... How convenient.
>>
>> -# cat /etc/udev/rules.d/99-yolo.rules
>> ACTION=="add|change",
>> ENV{ID_FS_UUID_SUB_ENC}=="4139fb4c-e7c4-49c7-a4ce-5c86f683ffdc",
>> ATTR{queue/rotational}="1"
>> ACTION=="add|change",
>> ENV{ID_FS_UUID_SUB_ENC}=="192139f4-1618-4089-95fd-4a863db9416b",
>> ATTR{queue/rotational}="0"
> 
> Yes but of course this should be an exception than the default

For non-local storage it's the default that this rotational value is
completely bogus.

What I mean is that I like that this PoC patch (ab)uses existing stuff,
and does not rely on changing the filesystem (yet) in any way, so it can
be thrown out at any time later without consequences.

>>> This new mode is enabled passing the option ssd_metadata at mount time.
>>> This policy of allocation is the "preferred" one. If this doesn't permit
>>> a chunk allocation, the "classic" one is used.
>>>
>>> Some examples: (/dev/sd[abc] are ssd, and /dev/sd[ef] are rotational)
>>>
>>> Non striped profile: metadata->raid1, data->raid1
>>> The data is stored on /dev/sd[ef], metadata is stored on /dev/sd[abc].
>>> When /dev/sd[ef] are full, then the data chunk is allocated also on
>>> /dev/sd[abc].
>>>
>>> Striped profile: metadata->raid6, data->raid6
>>> raid6 requires 3 disks at minimum, so /dev/sd[ef] are not enough for a
>>> data profile raid6. To allow a data chunk allocation, the data profile raid6
>>> will be stored on all the disks /dev/sd[abcdef].
>>> Instead the metadata profile raid6 will be allocated on /dev/sd[abc],
>>> because these are enough to host this chunk.
>>
>> Yes, and while the explanation above focuses on multi-disk profiles, it
>> might be useful (for the similar section in later versions) to
>> explicitly mention that for single profile, the same algorithm will just
>> cause it to overflow to a less preferred disk if the preferred one is
>> completely full. Neat!
>>
>> I've been testing this change on top of my 4.19 kernel, and also tried
>> to come up with some edge cases, doing ridiculous things to generate
>> metadata usage en do stuff like btrfs fi resize to push metadata away
>> from the prefered device etc... No weird things happened.
>>
>> I guess there will be no further work on this V3, the only comment I
>> would have now is that an Opt_no_ssd_metadata would be nice for testing,
>> but I can hack that in myself.
> 
> Because ssd_metadata is not a default, what would be the purpouse of
> Opt_no_ssd_metadata ?

While testing, mount -o remount,no_ssd_metadata without having to umount
/ mount and stop data generating/removing test processes, so that data
gets written to the "wrong" disks again.

Hans


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-05-29 16:06 ` Hans van Kranenburg
@ 2020-05-29 16:40   ` Goffredo Baroncelli
  2020-05-29 18:37     ` Hans van Kranenburg
  0 siblings, 1 reply; 17+ messages in thread
From: Goffredo Baroncelli @ 2020-05-29 16:40 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs
  Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui

On 5/29/20 6:06 PM, Hans van Kranenburg wrote:
> Hi Goffredo,
> 
> On 4/5/20 10:26 AM, Goffredo Baroncelli wrote:
>>
>> This is an RFC; I wrote this patch because I find the idea interesting
>> even though it adds more complication to the chunk allocator.
>>
>> The core idea is to store the metadata on the ssd and to leave the data
>> on the rotational disks. BTRFS looks at the rotational flags to
>> understand the kind of disks.
> 
> Like I said yesterday, thanks for working on these kind of proof of
> concepts. :)
> 
> Even while this can't be a final solution, it's still very useful in the
> meantime for users for which this is sufficient right now.
> 
> I simply did not realize before that it was possible to just set that
> rotational flag myself using an udev rule... How convenient.
> 
> -# cat /etc/udev/rules.d/99-yolo.rules
> ACTION=="add|change",
> ENV{ID_FS_UUID_SUB_ENC}=="4139fb4c-e7c4-49c7-a4ce-5c86f683ffdc",
> ATTR{queue/rotational}="1"
> ACTION=="add|change",
> ENV{ID_FS_UUID_SUB_ENC}=="192139f4-1618-4089-95fd-4a863db9416b",
> ATTR{queue/rotational}="0"

Yes but of course this should be an exception than the default

> 
>> This new mode is enabled passing the option ssd_metadata at mount time.
>> This policy of allocation is the "preferred" one. If this doesn't permit
>> a chunk allocation, the "classic" one is used.
>>
>> Some examples: (/dev/sd[abc] are ssd, and /dev/sd[ef] are rotational)
>>
>> Non striped profile: metadata->raid1, data->raid1
>> The data is stored on /dev/sd[ef], metadata is stored on /dev/sd[abc].
>> When /dev/sd[ef] are full, then the data chunk is allocated also on
>> /dev/sd[abc].
>>
>> Striped profile: metadata->raid6, data->raid6
>> raid6 requires 3 disks at minimum, so /dev/sd[ef] are not enough for a
>> data profile raid6. To allow a data chunk allocation, the data profile raid6
>> will be stored on all the disks /dev/sd[abcdef].
>> Instead the metadata profile raid6 will be allocated on /dev/sd[abc],
>> because these are enough to host this chunk.
> 
> Yes, and while the explanation above focuses on multi-disk profiles, it
> might be useful (for the similar section in later versions) to
> explicitly mention that for single profile, the same algorithm will just
> cause it to overflow to a less preferred disk if the preferred one is
> completely full. Neat!
> 
> I've been testing this change on top of my 4.19 kernel, and also tried
> to come up with some edge cases, doing ridiculous things to generate
> metadata usage en do stuff like btrfs fi resize to push metadata away
> from the prefered device etc... No weird things happened.
> 
> I guess there will be no further work on this V3, the only comment I
> would have now is that an Opt_no_ssd_metadata would be nice for testing,
> but I can hack that in myself.

Because ssd_metadata is not a default, what would be the purpouse of
Opt_no_ssd_metadata ?

> 
> Thanks,
> Hans
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-05  8:26 Goffredo Baroncelli
  2020-04-05 10:57 ` Graham Cobb
@ 2020-05-29 16:06 ` Hans van Kranenburg
  2020-05-29 16:40   ` Goffredo Baroncelli
  2020-05-30  4:59 ` Qu Wenruo
  2 siblings, 1 reply; 17+ messages in thread
From: Hans van Kranenburg @ 2020-05-29 16:06 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs
  Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui

Hi Goffredo,

On 4/5/20 10:26 AM, Goffredo Baroncelli wrote:
> 
> This is an RFC; I wrote this patch because I find the idea interesting
> even though it adds more complication to the chunk allocator.
> 
> The core idea is to store the metadata on the ssd and to leave the data
> on the rotational disks. BTRFS looks at the rotational flags to
> understand the kind of disks.

Like I said yesterday, thanks for working on these kind of proof of
concepts. :)

Even while this can't be a final solution, it's still very useful in the
meantime for users for which this is sufficient right now.

I simply did not realize before that it was possible to just set that
rotational flag myself using an udev rule... How convenient.

-# cat /etc/udev/rules.d/99-yolo.rules
ACTION=="add|change",
ENV{ID_FS_UUID_SUB_ENC}=="4139fb4c-e7c4-49c7-a4ce-5c86f683ffdc",
ATTR{queue/rotational}="1"
ACTION=="add|change",
ENV{ID_FS_UUID_SUB_ENC}=="192139f4-1618-4089-95fd-4a863db9416b",
ATTR{queue/rotational}="0"

> This new mode is enabled passing the option ssd_metadata at mount time.
> This policy of allocation is the "preferred" one. If this doesn't permit
> a chunk allocation, the "classic" one is used.
> 
> Some examples: (/dev/sd[abc] are ssd, and /dev/sd[ef] are rotational)
> 
> Non striped profile: metadata->raid1, data->raid1
> The data is stored on /dev/sd[ef], metadata is stored on /dev/sd[abc].
> When /dev/sd[ef] are full, then the data chunk is allocated also on
> /dev/sd[abc].
> 
> Striped profile: metadata->raid6, data->raid6
> raid6 requires 3 disks at minimum, so /dev/sd[ef] are not enough for a
> data profile raid6. To allow a data chunk allocation, the data profile raid6
> will be stored on all the disks /dev/sd[abcdef].
> Instead the metadata profile raid6 will be allocated on /dev/sd[abc],
> because these are enough to host this chunk.

Yes, and while the explanation above focuses on multi-disk profiles, it
might be useful (for the similar section in later versions) to
explicitly mention that for single profile, the same algorithm will just
cause it to overflow to a less preferred disk if the preferred one is
completely full. Neat!

I've been testing this change on top of my 4.19 kernel, and also tried
to come up with some edge cases, doing ridiculous things to generate
metadata usage en do stuff like btrfs fi resize to push metadata away
from the prefered device etc... No weird things happened.

I guess there will be no further work on this V3, the only comment I
would have now is that an Opt_no_ssd_metadata would be nice for testing,
but I can hack that in myself.

Thanks,
Hans

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-06 17:33         ` Goffredo Baroncelli
@ 2020-04-06 17:40           ` Zygo Blaxell
  0 siblings, 0 replies; 17+ messages in thread
From: Zygo Blaxell @ 2020-04-06 17:40 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: Graham Cobb, linux-btrfs

On Mon, Apr 06, 2020 at 07:33:16PM +0200, Goffredo Baroncelli wrote:
> On 4/6/20 7:21 PM, Zygo Blaxell wrote:
> > On Mon, Apr 06, 2020 at 06:43:04PM +0200, Goffredo Baroncelli wrote:
> > > On 4/6/20 4:24 AM, Zygo Blaxell wrote:
> > > > > > Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
> > > > > > apt on a rotational was a dramatic experience. And IMHO  this should be replaced
> > > > > > by using the btrfs snapshot capabilities. But this is another (not easy) story.
> > > > flushoncommit and eatmydata work reasonably well...once you patch out the
> > > > noise warnings from fs-writeback.
> > > > 
> > > 
> > > You wrote flushoncommit, but did you mean "noflushoncommit" ?
> > 
> > No.  "noflushoncommit" means applications have to call fsync() all the
> > time, or their files get trashed on a crash.  I meant flushoncommit
> > and eatmydata.
> 
> It is a tristate value (default, flushoncommit, noflushoncommit), or
> flushoncommit IS the default ?

noflushoncommit is the default.  flushoncommit is sort of terrible--it
used to have deadlock bugs up to 4.15, and spams the kernel log with
warnings since 4.15.

> > While dpkg runs, it must never call fsync, or it breaks the write
> > ordering provided by flushoncommit (or you have to zero-log on boot).
> > btrfs effectively does a point-in-time snapshot at every commit interval.
> > dpkg's ordering of write operations and renames does the rest.
> > 
> > dpkg runs much faster, so the window for interruption is smaller, and
> > if it is interrupted, then the result is more or less the same as if
> > you had run with fsync() on noflushoncommit.  The difference is that
> > the filesystem might roll back to an earlier state after a crash, which
> > could be a problem e.g. if your maintainer scripts are manipulating data
> > on multiple filesystems.
> > 
> > 
> > > Regarding eatmydata, I used it too. However I was never happy. Below my script:
> > > ----------------------------------
> > > ghigo@venice:/etc/apt/apt.conf.d$ cat 10btrfs.conf
> > > 
> > > DPkg::Pre-Invoke {"bash /var/btrfs/btrfs-apt.sh snapshot";};
> > > DPkg::Post-Invoke {"bash /var/btrfs/btrfs-apt.sh clean";};
> > > Dpkg::options {"--force-unsafe-io";};
> > > ---------------------------------
> > > ghigo@venice:/etc/apt/apt.conf.d$ cat /var/btrfs/btrfs-apt.sh
> > > 
> > > btrfsroot=/var/btrfs/debian
> > > btrfsrollback=/var/btrfs/debian-rollback
> > > 
> > > 
> > > do_snapshot() {
> > > 	if [ -d "$btrfsrollback" ]; then
> > > 		btrfs subvolume delete "$btrfsrollback"
> > > 	fi
> > > 
> > > 	i=20
> > > 	while [ $i -gt 0 -a -d "$btrfsrollback" ]; do
> > > 		i=$(( $i + 1 ))
> > > 		sleep 0.1
> > > 	done
> > > 	if [ $i -eq 0 ]; then
> > > 		exit 100
> > > 	fi
> > > 
> > > 	btrfs subvolume snapshot "$btrfsroot" "$btrfsrollback"
> > > 	
> > > }
> > > 
> > > do_removerollback() {
> > > 	if [ -d "$btrfsrollback" ]; then
> > > 		btrfs subvolume delete "$btrfsrollback"
> > > 	fi
> > > }
> > > 
> > > if [ "$1" = "snapshot" ]; then
> > > 	do_snapshot
> > > elif [ "$1" = "clean" ]; then
> > > 	do_removerollback
> > > else
> > > 	echo "usage: $0  snapshot|clean"
> > > fi
> > > --------------------------------------------------------------
> > > 
> > > Suggestion are welcome how detect automatically where is mount the
> > > btrfs root (subvolume=/) and  my root subvolume name (debian in my
> > > case). So I will avoid to wrote directly in my script.
> > 
> > You can figure out where "/" is within a btrfs filesystem by recusively
> > looking up parent subvol IDs with TREE_SEARCH_V2 until you get to 5
> > FS_ROOT (sort of like the way pwd works on traditional Unix); however,
> > root can be a bind mount, so "path from fs_root to /" is not guaranteed
> > to end at a subvol root.
> 
> May be an use case for a new ioctl :-) ? Snapshot a subvolume without
> mounting the root subvolume....

That would make access control mechanisms like chroot...challenging.
;)  But I hear we have a delete-by-id ioctl now, so might as well have
snap-by-id too.

> > Also, sometimes people put /var on its own subvol, so you'd need to
> > find "the set of all subvols relevant to dpkg" and that's definitely
> > not trivial in the general case.
> 
> I know that a general rule it is not easy. Anyway I also would put /boot
> and /home in a dedicated subvolume.
> If the "roolback" is done at boot, /boot should be an invariant...
> However I think that there are a lot of corner case even here (what happens
> if the boot kernel doesn't have modules in the root subvolume ?)
> 
> It is not an easy job. It must be performed at distribution level...
> 
> > 
> > It's not as easy to figure out if there's an existing fs_root mount
> > point (partly because namespacing mangles every path in /proc/mounts
> > and mountinfo), but if you know the btrfs device (and can access it
> > from your namespace) you can just mount it somewhere and then you do
> > know where it is.
> 
> I agree, looking from root to the "root device", then mount the
> root subvolume in a know place, where it is possible to snapshot
> the root subvolume.
> 
> > 
> > > BR
> > > G.Baroncelli
> > > -- 
> > > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> > > Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> > > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-06 17:21       ` Zygo Blaxell
@ 2020-04-06 17:33         ` Goffredo Baroncelli
  2020-04-06 17:40           ` Zygo Blaxell
  0 siblings, 1 reply; 17+ messages in thread
From: Goffredo Baroncelli @ 2020-04-06 17:33 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Graham Cobb, linux-btrfs

On 4/6/20 7:21 PM, Zygo Blaxell wrote:
> On Mon, Apr 06, 2020 at 06:43:04PM +0200, Goffredo Baroncelli wrote:
>> On 4/6/20 4:24 AM, Zygo Blaxell wrote:
>>>>> Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
>>>>> apt on a rotational was a dramatic experience. And IMHO  this should be replaced
>>>>> by using the btrfs snapshot capabilities. But this is another (not easy) story.
>>> flushoncommit and eatmydata work reasonably well...once you patch out the
>>> noise warnings from fs-writeback.
>>>
>>
>> You wrote flushoncommit, but did you mean "noflushoncommit" ?
> 
> No.  "noflushoncommit" means applications have to call fsync() all the
> time, or their files get trashed on a crash.  I meant flushoncommit
> and eatmydata.

It is a tristate value (default, flushoncommit, noflushoncommit), or
flushoncommit IS the default ?
> 
> While dpkg runs, it must never call fsync, or it breaks the write
> ordering provided by flushoncommit (or you have to zero-log on boot).
> btrfs effectively does a point-in-time snapshot at every commit interval.
> dpkg's ordering of write operations and renames does the rest.
> 
> dpkg runs much faster, so the window for interruption is smaller, and
> if it is interrupted, then the result is more or less the same as if
> you had run with fsync() on noflushoncommit.  The difference is that
> the filesystem might roll back to an earlier state after a crash, which
> could be a problem e.g. if your maintainer scripts are manipulating data
> on multiple filesystems.
> 
> 
>> Regarding eatmydata, I used it too. However I was never happy. Below my script:
>> ----------------------------------
>> ghigo@venice:/etc/apt/apt.conf.d$ cat 10btrfs.conf
>>
>> DPkg::Pre-Invoke {"bash /var/btrfs/btrfs-apt.sh snapshot";};
>> DPkg::Post-Invoke {"bash /var/btrfs/btrfs-apt.sh clean";};
>> Dpkg::options {"--force-unsafe-io";};
>> ---------------------------------
>> ghigo@venice:/etc/apt/apt.conf.d$ cat /var/btrfs/btrfs-apt.sh
>>
>> btrfsroot=/var/btrfs/debian
>> btrfsrollback=/var/btrfs/debian-rollback
>>
>>
>> do_snapshot() {
>> 	if [ -d "$btrfsrollback" ]; then
>> 		btrfs subvolume delete "$btrfsrollback"
>> 	fi
>>
>> 	i=20
>> 	while [ $i -gt 0 -a -d "$btrfsrollback" ]; do
>> 		i=$(( $i + 1 ))
>> 		sleep 0.1
>> 	done
>> 	if [ $i -eq 0 ]; then
>> 		exit 100
>> 	fi
>>
>> 	btrfs subvolume snapshot "$btrfsroot" "$btrfsrollback"
>> 	
>> }
>>
>> do_removerollback() {
>> 	if [ -d "$btrfsrollback" ]; then
>> 		btrfs subvolume delete "$btrfsrollback"
>> 	fi
>> }
>>
>> if [ "$1" = "snapshot" ]; then
>> 	do_snapshot
>> elif [ "$1" = "clean" ]; then
>> 	do_removerollback
>> else
>> 	echo "usage: $0  snapshot|clean"
>> fi
>> --------------------------------------------------------------
>>
>> Suggestion are welcome how detect automatically where is mount the
>> btrfs root (subvolume=/) and  my root subvolume name (debian in my
>> case). So I will avoid to wrote directly in my script.
> 
> You can figure out where "/" is within a btrfs filesystem by recusively
> looking up parent subvol IDs with TREE_SEARCH_V2 until you get to 5
> FS_ROOT (sort of like the way pwd works on traditional Unix); however,
> root can be a bind mount, so "path from fs_root to /" is not guaranteed
> to end at a subvol root.

May be an use case for a new ioctl :-) ? Snapshot a subvolume without
mounting the root subvolume....

> 
> Also, sometimes people put /var on its own subvol, so you'd need to
> find "the set of all subvols relevant to dpkg" and that's definitely
> not trivial in the general case.

I know that a general rule it is not easy. Anyway I also would put /boot
and /home in a dedicated subvolume.
If the "roolback" is done at boot, /boot should be an invariant...
However I think that there are a lot of corner case even here (what happens
if the boot kernel doesn't have modules in the root subvolume ?)

It is not an easy job. It must be performed at distribution level...

> 
> It's not as easy to figure out if there's an existing fs_root mount
> point (partly because namespacing mangles every path in /proc/mounts
> and mountinfo), but if you know the btrfs device (and can access it
> from your namespace) you can just mount it somewhere and then you do
> know where it is.

I agree, looking from root to the "root device", then mount the
root subvolume in a know place, where it is possible to snapshot
the root subvolume.

> 
>> BR
>> G.Baroncelli
>> -- 
>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>>


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-06 16:43     ` Goffredo Baroncelli
@ 2020-04-06 17:21       ` Zygo Blaxell
  2020-04-06 17:33         ` Goffredo Baroncelli
  0 siblings, 1 reply; 17+ messages in thread
From: Zygo Blaxell @ 2020-04-06 17:21 UTC (permalink / raw)
  To: kreijack; +Cc: Graham Cobb, linux-btrfs

On Mon, Apr 06, 2020 at 06:43:04PM +0200, Goffredo Baroncelli wrote:
> On 4/6/20 4:24 AM, Zygo Blaxell wrote:
> > > > Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
> > > > apt on a rotational was a dramatic experience. And IMHO  this should be replaced
> > > > by using the btrfs snapshot capabilities. But this is another (not easy) story.
> > flushoncommit and eatmydata work reasonably well...once you patch out the
> > noise warnings from fs-writeback.
> > 
> 
> You wrote flushoncommit, but did you mean "noflushoncommit" ?

No.  "noflushoncommit" means applications have to call fsync() all the
time, or their files get trashed on a crash.  I meant flushoncommit
and eatmydata.

While dpkg runs, it must never call fsync, or it breaks the write
ordering provided by flushoncommit (or you have to zero-log on boot).
btrfs effectively does a point-in-time snapshot at every commit interval.
dpkg's ordering of write operations and renames does the rest.

dpkg runs much faster, so the window for interruption is smaller, and
if it is interrupted, then the result is more or less the same as if
you had run with fsync() on noflushoncommit.  The difference is that
the filesystem might roll back to an earlier state after a crash, which
could be a problem e.g. if your maintainer scripts are manipulating data
on multiple filesystems.

> Regarding eatmydata, I used it too. However I was never happy. Below my script:
> ----------------------------------
> ghigo@venice:/etc/apt/apt.conf.d$ cat 10btrfs.conf
> 
> DPkg::Pre-Invoke {"bash /var/btrfs/btrfs-apt.sh snapshot";};
> DPkg::Post-Invoke {"bash /var/btrfs/btrfs-apt.sh clean";};
> Dpkg::options {"--force-unsafe-io";};
> ---------------------------------
> ghigo@venice:/etc/apt/apt.conf.d$ cat /var/btrfs/btrfs-apt.sh
> 
> btrfsroot=/var/btrfs/debian
> btrfsrollback=/var/btrfs/debian-rollback
> 
> 
> do_snapshot() {
> 	if [ -d "$btrfsrollback" ]; then
> 		btrfs subvolume delete "$btrfsrollback"
> 	fi
> 
> 	i=20
> 	while [ $i -gt 0 -a -d "$btrfsrollback" ]; do
> 		i=$(( $i + 1 ))
> 		sleep 0.1
> 	done
> 	if [ $i -eq 0 ]; then
> 		exit 100
> 	fi
> 
> 	btrfs subvolume snapshot "$btrfsroot" "$btrfsrollback"
> 	
> }
> 
> do_removerollback() {
> 	if [ -d "$btrfsrollback" ]; then
> 		btrfs subvolume delete "$btrfsrollback"
> 	fi
> }
> 
> if [ "$1" = "snapshot" ]; then
> 	do_snapshot
> elif [ "$1" = "clean" ]; then
> 	do_removerollback
> else
> 	echo "usage: $0  snapshot|clean"
> fi
> --------------------------------------------------------------
> 
> Suggestion are welcome how detect automatically where is mount the
> btrfs root (subvolume=/) and  my root subvolume name (debian in my
> case). So I will avoid to wrote directly in my script.

You can figure out where "/" is within a btrfs filesystem by recusively
looking up parent subvol IDs with TREE_SEARCH_V2 until you get to 5
FS_ROOT (sort of like the way pwd works on traditional Unix); however,
root can be a bind mount, so "path from fs_root to /" is not guaranteed
to end at a subvol root.

Also, sometimes people put /var on its own subvol, so you'd need to
find "the set of all subvols relevant to dpkg" and that's definitely
not trivial in the general case.

It's not as easy to figure out if there's an existing fs_root mount
point (partly because namespacing mangles every path in /proc/mounts
and mountinfo), but if you know the btrfs device (and can access it
from your namespace) you can just mount it somewhere and then you do
know where it is.

> BR
> G.Baroncelli
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-06  2:24   ` Zygo Blaxell
@ 2020-04-06 16:43     ` Goffredo Baroncelli
  2020-04-06 17:21       ` Zygo Blaxell
  0 siblings, 1 reply; 17+ messages in thread
From: Goffredo Baroncelli @ 2020-04-06 16:43 UTC (permalink / raw)
  To: Zygo Blaxell, Graham Cobb; +Cc: linux-btrfs

On 4/6/20 4:24 AM, Zygo Blaxell wrote:
>>> Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
>>> apt on a rotational was a dramatic experience. And IMHO  this should be replaced
>>> by using the btrfs snapshot capabilities. But this is another (not easy) story.
> flushoncommit and eatmydata work reasonably well...once you patch out the
> noise warnings from fs-writeback.
> 

You wrote flushoncommit, but did you mean "noflushoncommit" ?

Regarding eatmydata, I used it too. However I was never happy. Below my script:
----------------------------------
ghigo@venice:/etc/apt/apt.conf.d$ cat 10btrfs.conf

DPkg::Pre-Invoke {"bash /var/btrfs/btrfs-apt.sh snapshot";};
DPkg::Post-Invoke {"bash /var/btrfs/btrfs-apt.sh clean";};
Dpkg::options {"--force-unsafe-io";};
---------------------------------
ghigo@venice:/etc/apt/apt.conf.d$ cat /var/btrfs/btrfs-apt.sh

btrfsroot=/var/btrfs/debian
btrfsrollback=/var/btrfs/debian-rollback


do_snapshot() {
	if [ -d "$btrfsrollback" ]; then
		btrfs subvolume delete "$btrfsrollback"
	fi

	i=20
	while [ $i -gt 0 -a -d "$btrfsrollback" ]; do
		i=$(( $i + 1 ))
		sleep 0.1
	done
	if [ $i -eq 0 ]; then
		exit 100
	fi

	btrfs subvolume snapshot "$btrfsroot" "$btrfsrollback"
	
}

do_removerollback() {
	if [ -d "$btrfsrollback" ]; then
		btrfs subvolume delete "$btrfsrollback"
	fi
}

if [ "$1" = "snapshot" ]; then
	do_snapshot
elif [ "$1" = "clean" ]; then
	do_removerollback
else
	echo "usage: $0  snapshot|clean"
fi
--------------------------------------------------------------

Suggestion are welcome how detect automatically where is mount the btrfs root (subvolume=/) and  my root subvolume name (debian in my case). So I will avoid to wrote directly in my script.

BR
G.Baroncelli
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-05 10:57 ` Graham Cobb
  2020-04-05 18:47   ` Goffredo Baroncelli
@ 2020-04-06  2:24   ` Zygo Blaxell
  2020-04-06 16:43     ` Goffredo Baroncelli
  1 sibling, 1 reply; 17+ messages in thread
From: Zygo Blaxell @ 2020-04-06  2:24 UTC (permalink / raw)
  To: Graham Cobb; +Cc: Goffredo Baroncelli, linux-btrfs

On Sun, Apr 05, 2020 at 11:57:49AM +0100, Graham Cobb wrote:
> On 05/04/2020 09:26, Goffredo Baroncelli wrote:
> ...
> 
> > I considered the following scenarios:
> > - btrfs over ssd
> > - btrfs over ssd + hdd with my patch enabled
> > - btrfs over bcache over hdd+ssd
> > - btrfs over hdd (very, very slow....)
> > - ext4 over ssd
> > - ext4 over hdd
> > 
> > The test machine was an "AMD A6-6400K" with 4GB of ram, where 3GB was used
> > as cache/buff.
> > 
> > Data analysis:
> > 
> > Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
> > apt on a rotational was a dramatic experience. And IMHO  this should be replaced
> > by using the btrfs snapshot capabilities. But this is another (not easy) story.

flushoncommit and eatmydata work reasonably well...once you patch out the
noise warnings from fs-writeback.

> > Unsurprising bcache performs better than my patch. But this is an expected
> > result because it can cache also the data chunk (the read can goes directly to
> > the ssd). bcache perform about +60% slower when there are a lot of sync/flush
> > and only +20% in the other case.
> > 
> > Regarding the test with force-unsafe-io (fewer sync/flush), my patch reduce the
> > time from +256% to +113%  than the hdd-only . Which I consider a good
> > results considering how small is the patch.
> > 
> > 
> > Raw data:
> > The data below is the "real" time (as return by the time command) consumed by
> > apt
> > 
> > 
> > Test description         real (mmm:ss)	Delta %
> > --------------------     -------------  -------
> > btrfs hdd w/sync	   142:38	+533%
> > btrfs ssd+hdd w/sync        81:04	+260%
> > ext4 hdd w/sync	            52:39	+134%
> > btrfs bcache w/sync	    35:59	 +60%
> > btrfs ssd w/sync	    22:31	reference
> > ext4 ssd w/sync	            12:19	 -45%
> 
> Interesting data but it seems to be missing the case of btrfs ssd+hdd
> w/sync without your patch in order to tell what difference your patch
> made. Or am I confused?

Goffredo's test was using profile 'single' for both data and metadata,
so the unpatched allocator would use the biggest device (hdd) for all
block groups and ignore the smaller one (ssd).  The result should be
the same as plain btrfs hdd, give or take a few superblock updates.

Of course, no one should ever use 'single' profile for metadata, except
on disposable filesystems like the ones people use for benchmarks.  ;)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-05 18:47   ` Goffredo Baroncelli
@ 2020-04-05 21:58     ` Adam Borowski
  0 siblings, 0 replies; 17+ messages in thread
From: Adam Borowski @ 2020-04-05 21:58 UTC (permalink / raw)
  To: kreijack; +Cc: Graham Cobb, linux-btrfs

On Sun, Apr 05, 2020 at 08:47:15PM +0200, Goffredo Baroncelli wrote:
> Currently BTRFS allocates the chunk on the basis of the free space.
> 
> For my tests I have a smaller ssd (20GB) and a bigger hdd (230GB).
> This means that the latter has higher priority for the allocation,
> until the free space became equal.
> 
> The rationale behind my patch is the following:
> - is quite simple (even tough in 3 iteration I put two errors :-) )
> - BTRFS has already two kind of information to store: data and metadata.
>   The former is (a lot ) bigger, than the latter. Having two kind of storage,
>   one faster (and expensive) than the other, it is natural to put the metadata
>   in the faster one, and the data in the slower one.

But why do you assume that SSD means fast?  Even with traditional disks
only, you can have a SATA-connected array for data and NVMe for metadata,
legacy NVMe for data and NVMe Optane for metadata -- but the real fun starts
if you put metadata on Optane pmem.

There are many storage tiers, and your patch hard-codes the lowest one as
the only determinator.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ in the beginning was the boot and root floppies and they were good.
⢿⡄⠘⠷⠚⠋⠀                                       -- <willmore> on #linux-sunxi
⠈⠳⣄⠀⠀⠀⠀

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-05 10:57 ` Graham Cobb
@ 2020-04-05 18:47   ` Goffredo Baroncelli
  2020-04-05 21:58     ` Adam Borowski
  2020-04-06  2:24   ` Zygo Blaxell
  1 sibling, 1 reply; 17+ messages in thread
From: Goffredo Baroncelli @ 2020-04-05 18:47 UTC (permalink / raw)
  To: Graham Cobb, linux-btrfs

On 4/5/20 12:57 PM, Graham Cobb wrote:
> On 05/04/2020 09:26, Goffredo Baroncelli wrote:
[...]
>>
>>
>> Test description      real (mmm:ss)	Delta %
>> --------------------  -------------  -------
>> btrfs hdd w/sync	   142:38	+533%
>> btrfs ssd+hdd w/sync     81:04	+260%
>> ext4 hdd w/sync          52:39	+134%
>> btrfs bcache w/sync      35:59	 +60%
>> btrfs ssd w/sync         22:31	reference
>> ext4 ssd w/syn           12:19	 -45%
> 
> Interesting data but it seems to be missing the case of btrfs ssd+hdd
> w/sync without your patch in order to tell what difference your patch
> made. Or am I confused?
> 
Currently BTRFS allocates the chunk on the basis of the free space.

For my tests I have a smaller ssd (20GB) and a bigger hdd (230GB).
This means that the latter has higher priority for the allocation,
until the free space became equal.

The rationale behind my patch is the following:
- is quite simple (even tough in 3 iteration I put two errors :-) )
- BTRFS has already two kind of information to store: data and metadata.
   The former is (a lot ) bigger, than the latter. Having two kind of storage,
   one faster (and expensive) than the other, it is natural to put the metadata
   in the faster one, and the data in the slower one.

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-05  8:26 Goffredo Baroncelli
@ 2020-04-05 10:57 ` Graham Cobb
  2020-04-05 18:47   ` Goffredo Baroncelli
  2020-04-06  2:24   ` Zygo Blaxell
  2020-05-29 16:06 ` Hans van Kranenburg
  2020-05-30  4:59 ` Qu Wenruo
  2 siblings, 2 replies; 17+ messages in thread
From: Graham Cobb @ 2020-04-05 10:57 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs

On 05/04/2020 09:26, Goffredo Baroncelli wrote:
...

> I considered the following scenarios:
> - btrfs over ssd
> - btrfs over ssd + hdd with my patch enabled
> - btrfs over bcache over hdd+ssd
> - btrfs over hdd (very, very slow....)
> - ext4 over ssd
> - ext4 over hdd
> 
> The test machine was an "AMD A6-6400K" with 4GB of ram, where 3GB was used
> as cache/buff.
> 
> Data analysis:
> 
> Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
> apt on a rotational was a dramatic experience. And IMHO  this should be replaced
> by using the btrfs snapshot capabilities. But this is another (not easy) story.
> 
> Unsurprising bcache performs better than my patch. But this is an expected
> result because it can cache also the data chunk (the read can goes directly to
> the ssd). bcache perform about +60% slower when there are a lot of sync/flush
> and only +20% in the other case.
> 
> Regarding the test with force-unsafe-io (fewer sync/flush), my patch reduce the
> time from +256% to +113%  than the hdd-only . Which I consider a good
> results considering how small is the patch.
> 
> 
> Raw data:
> The data below is the "real" time (as return by the time command) consumed by
> apt
> 
> 
> Test description         real (mmm:ss)	Delta %
> --------------------     -------------  -------
> btrfs hdd w/sync	   142:38	+533%
> btrfs ssd+hdd w/sync        81:04	+260%
> ext4 hdd w/sync	            52:39	+134%
> btrfs bcache w/sync	    35:59	 +60%
> btrfs ssd w/sync	    22:31	reference
> ext4 ssd w/sync	            12:19	 -45%

Interesting data but it seems to be missing the case of btrfs ssd+hdd
w/sync without your patch in order to tell what difference your patch
made. Or am I confused?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
@ 2020-04-05  8:26 Goffredo Baroncelli
  2020-04-05 10:57 ` Graham Cobb
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Goffredo Baroncelli @ 2020-04-05  8:26 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui

Hi all,

This is an RFC; I wrote this patch because I find the idea interesting
even though it adds more complication to the chunk allocator.

The core idea is to store the metadata on the ssd and to leave the data
on the rotational disks. BTRFS looks at the rotational flags to
understand the kind of disks.

This new mode is enabled passing the option ssd_metadata at mount time.
This policy of allocation is the "preferred" one. If this doesn't permit
a chunk allocation, the "classic" one is used.

Some examples: (/dev/sd[abc] are ssd, and /dev/sd[ef] are rotational)

Non striped profile: metadata->raid1, data->raid1
The data is stored on /dev/sd[ef], metadata is stored on /dev/sd[abc].
When /dev/sd[ef] are full, then the data chunk is allocated also on
/dev/sd[abc].

Striped profile: metadata->raid6, data->raid6
raid6 requires 3 disks at minimum, so /dev/sd[ef] are not enough for a
data profile raid6. To allow a data chunk allocation, the data profile raid6
will be stored on all the disks /dev/sd[abcdef].
Instead the metadata profile raid6 will be allocated on /dev/sd[abc],
because these are enough to host this chunk.

Changelog:
v1: - first issue
v2: - rebased to v5.6.2
    - correct the comparison about the rotational disks (>= instead of >)
    - add the flag rotational to the struct btrfs_device_info to
      simplify the comparison function (btrfs_cmp_device_info*() )
v3: - correct the collision between BTRFS_MOUNT_DISCARD_ASYNC and
      BTRFS_MOUNT_SSD_METADATA.

Below I collected some data to highlight the performance increment.

Test setup:
I performed as test a "dist-upgrade" of a Debian from stretch to buster.
The test consisted in an image of a Debian stretch[1]  with the packages
needed under /var/cache/apt/archives/ (so no networking was involved).
For each test I formatted the filesystem from scratch, un-tar-red the
image and the ran "apt-get dist-upgrade" [2]. For each disk(s)/filesystem
combination I measured the time of apt dist-upgrade with and
without the flag "force-unsafe-io" which reduce the using of sync(2) and
flush(2). The ssd was 20GB big, the hdd was 230GB big,

I considered the following scenarios:
- btrfs over ssd
- btrfs over ssd + hdd with my patch enabled
- btrfs over bcache over hdd+ssd
- btrfs over hdd (very, very slow....)
- ext4 over ssd
- ext4 over hdd

The test machine was an "AMD A6-6400K" with 4GB of ram, where 3GB was used
as cache/buff.

Data analysis:

Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
apt on a rotational was a dramatic experience. And IMHO  this should be replaced
by using the btrfs snapshot capabilities. But this is another (not easy) story.

Unsurprising bcache performs better than my patch. But this is an expected
result because it can cache also the data chunk (the read can goes directly to
the ssd). bcache perform about +60% slower when there are a lot of sync/flush
and only +20% in the other case.

Regarding the test with force-unsafe-io (fewer sync/flush), my patch reduce the
time from +256% to +113%  than the hdd-only . Which I consider a good
results considering how small is the patch.

Raw data:
The data below is the "real" time (as return by the time command) consumed by
apt

Test description         real (mmm:ss)	Delta %
--------------------     -------------  -------
btrfs hdd w/sync	   142:38	+533%
btrfs ssd+hdd w/sync        81:04	+260%
ext4 hdd w/sync	            52:39	+134%
btrfs bcache w/sync	    35:59	 +60%
btrfs ssd w/sync	    22:31	reference
ext4 ssd w/sync	            12:19	 -45%

Test description         real (mmm:ss)	Delta %
--------------------     -------------  -------
btrfs hdd	             56:2	+256%
ext4 hdd	            51:32	+228%
btrfs ssd+hdd	            33:30	+113%
btrfs bcache	            18:57	 +20%
btrfs ssd	            15:44	reference
ext4 ssd	            11:49	 -25%

[1] I created the image, using "debootrap stretch", then I installed a set
of packages using the commands:

  # debootstrap stretch test/
  # chroot test/
  # mount -t proc proc proc
  # mount -t sysfs sys sys
  # apt --option=Dpkg::Options::=--force-confold \
        --option=Dpkg::options::=--force-unsafe-io \
	install mate-desktop-environment* xserver-xorg vim \
        task-kde-desktop task-gnome-desktop

Then updated the release from stretch to buster changing the file /etc/apt/source.list
Then I download the packages for the dist upgrade:

  # apt-get update
  # apt-get --download-only dist-upgrade

Then I create a tar of this image.
Before the dist upgrading the space used was about 7GB of space with 2281
packages. After the dist-upgrade, the space used was 9GB with 2870 packages.
The upgrade installed/updated about 2251 packages.

[2] The command was a bit more complex, to avoid an interactive session

  # mkfs.btrfs -m single -d single /dev/sdX
  # mount /dev/sdX test/
  # cd test
  # time tar xzf ../image.tgz
  # chroot .
  # mount -t proc proc proc
  # mount -t sysfs sys sys
  # export DEBIAN_FRONTEND=noninteractive
  # time apt-get -y --option=Dpkg::Options::=--force-confold \
	--option=Dpkg::options::=--force-unsafe-io dist-upgrade

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-05-30  8:57 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-27 15:06 [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD Torstein Eide
2020-04-28 19:31 ` Goffredo Baroncelli
  -- strict thread matches above, loose matches on Subject: below --
2020-04-05  8:26 Goffredo Baroncelli
2020-04-05 10:57 ` Graham Cobb
2020-04-05 18:47   ` Goffredo Baroncelli
2020-04-05 21:58     ` Adam Borowski
2020-04-06  2:24   ` Zygo Blaxell
2020-04-06 16:43     ` Goffredo Baroncelli
2020-04-06 17:21       ` Zygo Blaxell
2020-04-06 17:33         ` Goffredo Baroncelli
2020-04-06 17:40           ` Zygo Blaxell
2020-05-29 16:06 ` Hans van Kranenburg
2020-05-29 16:40   ` Goffredo Baroncelli
2020-05-29 18:37     ` Hans van Kranenburg
2020-05-30  4:59 ` Qu Wenruo
2020-05-30  6:48   ` Goffredo Baroncelli
2020-05-30  8:57     ` Paul Jones

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.